Thursday, 29 October 2020

catch22 features of signals

 By taking repeated measurements over time we can study the dynamics of our environment – be it the mean temperature of the UK by month, the daily opening prices of stock markets, or the heart rate of a patient in intensive care. The resulting data consists of an ordered list of single measurements and we will call it ‘time series’ from now on. Time series can be long (many measurements) and complex and in order to facilitate exploitation of the gathered data we often want to summarise the captured sequences. For example, we might collapse the 12 monthly mean temperatures for each of the past 100 years to a yearly average. This would enable us to remove the effects of the seasons, reduce 12 yearly measurements to 1 and thereby let us quickly compare the temperatures across many years without studying each monthly measurement. Taking the average value of a time-series is a very simple example of a so called ‘time series feature’, an operation that takes an ordered series of measurements as an input and gives back a single figure that quantifies one particular property of the data. By constructing a set of appropriate features, we can compare, distinguish and group many time series quickly and even understand in what aspects (i.e., features) two time series are similar or different.

 

Over the past decades, thousands of such time-series features have been developed across different scientific and industrial disciplines, many of which are much more sophisticated than an average over measurements. But which features should we choose from this wealth of options for a given data set of time series? Do features exist that can characterise and meaningfully distinguish sequences from a wide range of sources?

 

We here propose a selection procedure that tailors feature-sets to given collections of time-series datasets and that can identify features which are generally useful for many different sequence types. The selection is based on the rich collection of 7500+ diverse candidate features previously gathered in the comprehensive ‘highly comparative time-series analysis’ (hctsa) toolbox (paper here) from which we automatically curate a small, minimally redundant feature subset based on single-feature performances on the given collection of time-series classification tasks.

Figure 1: The selected 22 features perform only slightly worse than the full (pre-filtered) set of 4,791. A Scatter of classification accuracy in each dataset, error bars signify standard deviation across folds. B Mean execution times for time series of length 10,000. C Near-linear scaling of computation time with time-series length.

 

 

By applying our pipeline to a standard library of 93 classification problems in the data-mining literature (UEA/UCR), we compiled a set of 22 features (catch22) that we then implemented in C and wrapped for R, Python, and Matlab. The 22 resulting features individually possess discriminative power and only do ~10% worse than the full hctsa feature set on the considered data at a highly (1000-fold) reduced computation time, see Fig. 1.

 

As the UEA/UCR-datasets mainly consists of short, aligned, and normalised time series, the features are especially suited to these characteristics. The selection pipeline may be applied to other collections of time-series datasets with different properties to generate new, different feature sets and can further be adapted to performance metrics other than classification accuracy to select features for analyses such as clustering, regression, etc.

 

See full paper for all the details here for free (http://link.springer.com/article/10.1007/s10618-019-00647-x) under the title "catch22: CAnonical Time-series CHaracteristics Selected through highly comparative time-series analysis" in the journal Data Mining and Knowledge Discovery. The catch22 feature set is on GitHub (https://github.com/chlubba/catch22). Carl, Ben, Nick

 

 

Universal approaches to measuring social distances and segregation

How people form connections is a fundamental question in the social sciences. Peter Blau offered a powerful explanation: people connect based on their positions in a social space. Yet a principled measure of social distance remains elusive. Based on a social network model, we develop a family of intuitive segregation measures formalising the notion of distance in social space.

The Blau space metric we learn from connections between individuals offers an intuitive explanation for how people form friendships: the larger the distance, the less likely they are to share a bond. It can also be employed to visualise the relative positions of individuals in the social space: a map of society.

Using US and UK survey data, we show that the social fabric is relatively stable across time. Physical separation and age have the largest effect on social distance with implications for intergenerational mixing and isolation in later stages of life. You can read about our work "Inference of a universal social scale and segregation measures using social connectivity kernels" free here and in the journal Royal Society Interface here. Till and Nick.

Tuesday, 13 October 2020

Predicting and controlling rat feeding




We made a rather surprising discovery: that a stochastic model of rat feeding was quite predictive. Using a slowly and smoothly varying quantity (stomach fullness) we were then able to make good predictions on when the next feeding bout would be. Given the parameters of this model fit to diverse rats we were then able to study the effects of various drugs and explore in silico approaches to lowering food intake.

You can read a more extensive article discussing this paper on the Imperial news website. The article is available for free in the journal PLoS Biology here. Tom, Kevin and Nick


Finding communities in time-series

In many complex systems, the exact relationship between entities is unknown and unobservable. Instead, we may observe interdependent signals from the nodes, such as time series. Current methods for detecting communities (i.e. nodes that are more closely related to one another than to the rest of the network) when edges are unobservable typically involve a complicated process: choose a measure to assess the similarity between pairs of time series, convert the similarity matrix to a (weighted) network, and, finally, infer community structure. This approach is computationally expensive and each step of the three-stage process computes point estimates, making it difficult to distinguish genuine structure from noise.

Discovering clusters in financial time series 

We propose a Bayesian hierarchical model for multivariate time series data that provides an end-to-end community detection algorithm and propagates uncertainties directly from the raw data to the community labels. Our approach naturally supports multiscale community detection and enables community detection even for short time series. We uncover salient communities in both financial returns time series of S&P100 stocks and climate data in the United States. You can read the article for free here in Science Advances under the title Community detection in networks without observing edges. This was fun work with our collaborators Leto Peel and Renaud Lambiotte. Till and Nick


Stochastic Survival of the Densest: defective mitochondria could be seen as altruistic to understand their expansion

With age, our skeletal muscles (e.g. muscle of our legs and arms) work less well. In some people, there is a substantial loss of strength an...