By taking repeated measurements over time we can study the dynamics of our environment – be it the mean temperature of the UK by month, the daily opening prices of stock markets, or the heart rate of a patient in intensive care. The resulting data consists of an ordered list of single measurements and we will call it ‘time series’ from now on. Time series can be long (many measurements) and complex and in order to facilitate exploitation of the gathered data we often want to summarise the captured sequences. For example, we might collapse the 12 monthly mean temperatures for each of the past 100 years to a yearly average. This would enable us to remove the effects of the seasons, reduce 12 yearly measurements to 1 and thereby let us quickly compare the temperatures across many years without studying each monthly measurement. Taking the average value of a time-series is a very simple example of a so called ‘time series feature’, an operation that takes an ordered series of measurements as an input and gives back a single figure that quantifies one particular property of the data. By constructing a set of appropriate features, we can compare, distinguish and group many time series quickly and even understand in what aspects (i.e., features) two time series are similar or different.
Over the past decades, thousands of such time-series features have been developed across different scientific and industrial disciplines, many of which are much more sophisticated than an average over measurements. But which features should we choose from this wealth of options for a given data set of time series? Do features exist that can characterise and meaningfully distinguish sequences from a wide range of sources?
We here propose a selection procedure that tailors feature-sets to given collections of time-series datasets and that can identify features which are generally useful for many different sequence types. The selection is based on the rich collection of 7500+ diverse candidate features previously gathered in the comprehensive ‘highly comparative time-series analysis’ (hctsa) toolbox (paper here) from which we automatically curate a small, minimally redundant feature subset based on single-feature performances on the given collection of time-series classification tasks.
Figure 1: The selected 22 features perform only slightly worse than the full (pre-filtered) set of 4,791. A Scatter of classification accuracy in each dataset, error bars signify standard deviation across folds. B Mean execution times for time series of length 10,000. C Near-linear scaling of computation time with time-series length. |
By applying our pipeline to a standard library of 93 classification problems in the data-mining literature (UEA/UCR), we compiled a set of 22 features (catch22) that we then implemented in C and wrapped for R, Python, and Matlab. The 22 resulting features individually possess discriminative power and only do ~10% worse than the full hctsa feature set on the considered data at a highly (1000-fold) reduced computation time, see Fig. 1.
As the UEA/UCR-datasets mainly consists of short, aligned, and normalised time series, the features are especially suited to these characteristics. The selection pipeline may be applied to other collections of time-series datasets with different properties to generate new, different feature sets and can further be adapted to performance metrics other than classification accuracy to select features for analyses such as clustering, regression, etc.
See full paper for all the details here for free (http://link.springer.com/article/10.1007/s10618-019-00647-x) under the title "catch22: CAnonical Time-series CHaracteristics Selected through highly comparative time-series analysis" in the journal Data Mining and Knowledge Discovery. The catch22 feature set is on GitHub (https://github.com/chlubba/catch22). Carl, Ben, Nick