Thursday, 29 October 2020

catch22 features of signals

 By taking repeated measurements over time we can study the dynamics of our environment – be it the mean temperature of the UK by month, the daily opening prices of stock markets, or the heart rate of a patient in intensive care. The resulting data consists of an ordered list of single measurements and we will call it ‘time series’ from now on. Time series can be long (many measurements) and complex and in order to facilitate exploitation of the gathered data we often want to summarise the captured sequences. For example, we might collapse the 12 monthly mean temperatures for each of the past 100 years to a yearly average. This would enable us to remove the effects of the seasons, reduce 12 yearly measurements to 1 and thereby let us quickly compare the temperatures across many years without studying each monthly measurement. Taking the average value of a time-series is a very simple example of a so called ‘time series feature’, an operation that takes an ordered series of measurements as an input and gives back a single figure that quantifies one particular property of the data. By constructing a set of appropriate features, we can compare, distinguish and group many time series quickly and even understand in what aspects (i.e., features) two time series are similar or different.


Over the past decades, thousands of such time-series features have been developed across different scientific and industrial disciplines, many of which are much more sophisticated than an average over measurements. But which features should we choose from this wealth of options for a given data set of time series? Do features exist that can characterise and meaningfully distinguish sequences from a wide range of sources?


We here propose a selection procedure that tailors feature-sets to given collections of time-series datasets and that can identify features which are generally useful for many different sequence types. The selection is based on the rich collection of 7500+ diverse candidate features previously gathered in the comprehensive ‘highly comparative time-series analysis’ (hctsa) toolbox (paper here) from which we automatically curate a small, minimally redundant feature subset based on single-feature performances on the given collection of time-series classification tasks.

Figure 1: The selected 22 features perform only slightly worse than the full (pre-filtered) set of 4,791. A Scatter of classification accuracy in each dataset, error bars signify standard deviation across folds. B Mean execution times for time series of length 10,000. C Near-linear scaling of computation time with time-series length.



By applying our pipeline to a standard library of 93 classification problems in the data-mining literature (UEA/UCR), we compiled a set of 22 features (catch22) that we then implemented in C and wrapped for R, Python, and Matlab. The 22 resulting features individually possess discriminative power and only do ~10% worse than the full hctsa feature set on the considered data at a highly (1000-fold) reduced computation time, see Fig. 1.


As the UEA/UCR-datasets mainly consists of short, aligned, and normalised time series, the features are especially suited to these characteristics. The selection pipeline may be applied to other collections of time-series datasets with different properties to generate new, different feature sets and can further be adapted to performance metrics other than classification accuracy to select features for analyses such as clustering, regression, etc.


See full paper for all the details here for free ( under the title "catch22: CAnonical Time-series CHaracteristics Selected through highly comparative time-series analysis" in the journal Data Mining and Knowledge Discovery. The catch22 feature set is on GitHub ( Carl, Ben, Nick



Universal approaches to measuring social distances and segregation

How people form connections is a fundamental question in the social sciences. Peter Blau offered a powerful explanation: people connect based on their positions in a social space. Yet a principled measure of social distance remains elusive. Based on a social network model, we develop a family of intuitive segregation measures formalising the notion of distance in social space.

The Blau space metric we learn from connections between individuals offers an intuitive explanation for how people form friendships: the larger the distance, the less likely they are to share a bond. It can also be employed to visualise the relative positions of individuals in the social space: a map of society.

Using US and UK survey data, we show that the social fabric is relatively stable across time. Physical separation and age have the largest effect on social distance with implications for intergenerational mixing and isolation in later stages of life. You can read about our work "Inference of a universal social scale and segregation measures using social connectivity kernels" free here and in the journal Royal Society Interface here. Till and Nick.

Tuesday, 13 October 2020

Predicting and controlling rat feeding

We made a rather surprising discovery: that a stochastic model of rat feeding was quite predictive. Using a slowly and smoothly varying quantity (stomach fullness) we were then able to make good predictions on when the next feeding bout would be. Given the parameters of this model fit to diverse rats we were then able to study the effects of various drugs and explore in silico approaches to lowering food intake.

You can read a more extensive article discussing this paper on the Imperial news website. The article is available for free in the journal PLoS Biology here. Tom, Kevin and Nick

Finding communities in time-series

In many complex systems, the exact relationship between entities is unknown and unobservable. Instead, we may observe interdependent signals from the nodes, such as time series. Current methods for detecting communities (i.e. nodes that are more closely related to one another than to the rest of the network) when edges are unobservable typically involve a complicated process: choose a measure to assess the similarity between pairs of time series, convert the similarity matrix to a (weighted) network, and, finally, infer community structure. This approach is computationally expensive and each step of the three-stage process computes point estimates, making it difficult to distinguish genuine structure from noise.

Discovering clusters in financial time series 

We propose a Bayesian hierarchical model for multivariate time series data that provides an end-to-end community detection algorithm and propagates uncertainties directly from the raw data to the community labels. Our approach naturally supports multiscale community detection and enables community detection even for short time series. We uncover salient communities in both financial returns time series of S&P100 stocks and climate data in the United States. You can read the article for free here in Science Advances under the title Community detection in networks without observing edges. This was fun work with our collaborators Leto Peel and Renaud Lambiotte. Till and Nick

Monday, 25 May 2020

Why are fungal networks so widespread?

You can read about our recent paper on fungal networks in this blog article in Nature Evolution and Ecology. Confusingly the article is available for free in Nature Communications.

Luke, Mark and Nick

Thursday, 25 July 2019

Mitochondrial networks and Ageing in the Variance

Mitochondrial DNA (mtDNA) populations within our cells encode vital energetic machinery. MtDNA is housed within mitochondria, cellular compartments lined by two membranes, that lead a very dynamic life. Individual mitochondria can fuse when they meet, and fused mitochondria can fragment to become individual smaller mitochondria, all the while moving throughout the cell. The reasons for this dynamic activity remain unclear (we’ve compared hypotheses about them before here and here, with blog articles here). But what influence do these physical mitochondrial dynamics have on the genetic composition of mtDNA populations?

MtDNA populations can, naturally or as a result of gene therapies, consist of a mixture of different mtDNA types. Typically, different cells will have different proportions of, say, type A and type B. For example, one cell may be 20% type A, another cell may be 40% type A, and a third may be 70% type A. This variability matters because when a certain threshold (often around 60%) is crossed for some mtDNA types, we get devastating diseases.

We previously showed mathematically (blog) and experimentally (blog) that this cell-to-cell variability in mtDNA proportions (often called “heteroplasmy variance” and sometimes referred to via the “mtDNA bottleneck”) is expected to increase linearly over time. However, this analysis pictured mtDNAs as individual molecules, outside of their mitochondrial compartments. When mitochondria fuse to form larger compartments, their mtDNA is more protected: smaller mitochondria (and their internal mtDNA) are subject to greater degradation. More degradation means more replication, and more opportunities for the fraction of a particular type of mtDNA to change per unit time. In a new paper here in Genetics, we show (using a mathematical tour de force by Juvid) that this protection can dramatically influence cell-to-cell mtDNA variability. Specifically, the rate of heteroplasmy variance increase is scaled by the proportion of mitochondria that exist in a fragmented state. (It turns out that it's the proportion of itochondria that are fragmented that's important -- not whether the rate of fission-fusion is fast or slow).

This has knock-on effects for how the cell can best get rid of low-quality mutant mtDNA. In particular, if mitochondria are allowed to fuse based on their quality (“selective fusion”), we show that intermediate rates of fusion are best for removing mutants. Too much fusion, and all mtDNA is protected; too little, and good mtDNA cannot be sorted from bad mtDNA using the mitochondrial network. This mechanism could help explain why we see different levels of mitochondrial fusion in different conditions. More broadly, this link between mitochondrial physics and genetics (which we’ve also speculated about here (blog) and here) suggests one way that selective pressures and tradeoffs could influence mitochondrial dynamics, giving rise to the wide variety of behaviours that remain unexplained. Juvid, Nick, and Iain

Friday, 19 July 2019

The cell's power station policy

Our cells are filled with populations of mitochondrial DNA (mtDNA) molecules, which encode vital cellular machinery that supports our energy requirements. The cell invests energy in maintaining its mtDNA population, like us using electricity-powered tools to help maintain our power stations. Our cellular power stations can vary in quality (for example, mutations can damage mtDNA), and are subject to random influences. How should the cell best invest energy in controlling and maintaining its power stations? And can we use this answer to design better therapies to address damaged mtDNA?

In a new paper "Energetic costs of cellular and therapeutic control of stochastic mitochondrial DNA populations" free here in PLoS Computational Biology, we attempt to answer this question using mathematical modelling, linking with genetic experiments done by our excellent collaborators at Cambridge (Payam Gammage, Lindsey Van Haute and Michal Minczuk). We first expand a mathematical model for how diverse mtDNA populations within cells change over time – building new power stations and decommissioning old ones, under the “governance” of the cell. We then produce an “energy budget” for the cellular “society” – describing the costs of building, decommissioning, and maintaining different power stations, and the corresponding profits of energy generation.

We find some surprising results. First, it can get harder to maintain a good energy budget in a tissue (a collection of individual cellular “societies”) over time, even if demands stay the same and average mtDNA quality doesn’t change. This is because the cell-to-cell variability in mtDNA quality does increase, carrying with it an added energetic challenge. This increased challenge could be a contributing factor to the collection of problems involved in ageing.

An overview of our approach. A mathematical model for the processes and "budget" involved in controlling mtDNA populations makes a general set of biological predictions and explains gene-therapy observations

Next, we found that cells with only low-quality mtDNA can perform worse than cells with a mix of low- and high-quality mtDNA. This is because low-quality mtDNA may consume less cellular resource, although global efficiency is decreased. Linked to this, removal of low-quality mtDNA (decommissioning bad power stations) alone is not always the best strategy to improve performance. Instead, jointly elevating low- and high-quality mtDNA levels, avoiding this detrimental mixed regime, is the best strategy for some situations. These insights may help explain some of the negative effects recently observed in cells with mixed mtDNA populations.

Our theory suggests that mixed mtDNA populations may do worse than pure ones, even if the pure population is a low-functionality mutant. Image from Hanne's post here 

We identified how best to control cellular mtDNA populations across the full range of possible populations, and used this insight to link with exciting gene therapies where low-quality mtDNA is preferentially removed through an experimental intervention (using so-called “endonucleases” to cut particular mtDNA molecules). We found that strong, single treatments will be outperformed by weaker, longer-term treatments, and identified how the mtDNA variability we know is present can practically effect the outcome of these therapies. We hope that the principles found in this work both add to our basic understanding of ageing and mixed (“heteroplasmic”) mitochondrial populations, and may inform more efficient therapeutic approaches in the future. Iain, Hanne, Nick
(Hanne's also written a post about this paper, you can read it here)

Wednesday, 27 February 2019

Guessing the spreading time of rumours

Social scientists are fascinated by social influence. That is, how people's beliefs, opinions and actions are influenced by others. This is relevant for understanding voting, health behaviour or opinions on issues like vaccination and climate change (topics our group is interested in). Mathematically inclined social scientists often interpret social influence using network theory.  Networks or graphs are used to represent systems consisting of many individual units, known as nodes, and the interactions between them, which are referred to as edges or links. In social networks the nodes represent people and the links represent social ties such as friendships.

Given a particular graph there are tools for modelling how opinions and beliefs can spread through a graph. However, in practice we often don’t know the structure of the social network itself. This could be because: i) the data we would like is unavailable ii) privacy concerns about social network data mean we can't share it even if we have it iii) the data exists but is full of errors or omissions. Fortunately, we know a lot about the structure of social networks from decades of past research by social scientists and statisticians. For example, many social networks are known to be homophilous - this means that people who are similar to each other are more likely to share a social connection (e.g many of your friends are probably a similar age to you).

Inspired by this, we consider a simple mathematical model for homophilous networks known as a Random Geometric Graph (RGG). In an RGG the nodes are assigned random positions in a (unit) box. Nodes are connected to all the nodes which are within a set distance (see figure), which we refer to as the connection radius. Positions of nodes may represent the positions of individuals in geographic space or in some “social space” where the coordinate axis might represent attributes such as age, income and education level. Since social networks are homophilous we will expect those who are closer together in “social space” to share a social tie.
Example of a Random Geometric Graph with 100 nodes and a connection radius of 0.2.
One basic question we can ask about a network is: “how long does it take something to spread across it?” We refer to this as the diffusion timescale. The diffusion timescale in a graph is indicative of how well connected the graph is and governs how quickly we might expect a disease, rumour or the adoption of a new behaviour to spread through it (or even how long it will take a zombie apocalypse to take hold). In our recent research we focus on the question:

“If we do not know the network (but perhaps know some of its properties) how precisely can we know the diffusion timescale?”

We show that different RGGs drawn at random with the same number of nodes and connection radius can have very different diffusion timescales. This implies that if we don’t have a good grasp of the graph structure then it could be difficult to predict the outcomes of processes such as the spread of an opinion through a social network. Or alternatively we can gain lots of extra information about diffusion timescales if we happen to know the social co-ordinates of individuals. On the other hand, we do find some classes of RGGs where the diffusion time scale is very predictable given only knowledge of the number of nodes and the connection radius.

Our work helps put limitations on how accurately we can forecast the outcome of processes on networks given the available data (which is always imperfect). Future work may involve asking the same questions for real world datasets. In addition, most of our new results were obtained through computer simulations, meaning that there is also scope for more theory.

You can read about our research in the paper Large algebraic connectivity fluctuations in spatial network ensembles imply a predictive advantage from node location information” for free here or for not-free here in Physical Review E. Matt and Nick.

How mitochondria can vary, and consequences for human health

Mitochondria are components of the cell which are involved in generating “energy currency” molecules called ATP across much of complex life. Since many mitochondria exist within single cells (often hundreds or thousands), it is possible for the characteristics of individual mitochondria to vary within cells, and within tissues. This variation of mitochondrial characteristics can affect biological function and human health.

Since mitochondria possess their own, small, circular, DNA molecules (mtDNA), we can split mitochondrial characteristics into two categories: genetic and non-genetic. In our review, we discuss a number of aspects in which mitochondria vary, from both genetic and non-genetic perspectives. 

In terms of mitochondrial genetics, the amount of mtDNA per cell is variable. When a cell divides, its daughters receive a share of its parents mtDNA, but the split isn’t precisely 50/50, so cell division can cause variability in the number of mtDNAs per cell. As mtDNAs are replicated and degraded over time, errors in the copying process may give rise to mtDNA mutations, which may spread throughout a cell. Factors such as: the total amount, the rate of degradation/replication, the mean fraction of mutants, and the extent of fragmentation in the mitochondrial network, can all influence how variable the fraction of mutated mtDNAs becomes through time (see here for a preview of some upcoming work on this topic). The total amount, and mutated fraction of mtDNAs, are implicated in diseases such as neurodegeneration, as well as the ageing process.

Apart from genetic variations, there are many non-genetic features of mitochondria which also vary within and between cells. Changes in mtDNA sequence can change the amino-acid sequence of the proteins encoded by mtDNA, causing structural changes in the molecular machines which generate ATP. The shape of the membranes of mitochondria are also highly variable, and respond to mitochondrial activity through quantities such as pH, where mitochondrial activity itself may depend on mtDNA sequence. The previous two examples (mitochondrial protein and membrane structure) demonstrate how the genetic state of mitochondria may influence their non-genetic characteristics. Mitochondrial non-genetic characteristics may also influence the genetic state: for instance, mitochondrial membrane potential can influence the probability of a mitochondria being degraded, along with its mtDNA.

The inter-dependence of genetic and non-genetic characteristics demonstrate the complex feedback loops linking these two aspects of mitochondrial physiology. We suggest here that, since changes in mitochondrial genetics occur more slowly than most physical aspects of mitochondrial physiology, understanding mitochondrial genetics may be especially important in explaining phenomena such as ageing, which appears to be closely related to mitochondrial heterogeneity. You can freely access our work, which has recently been published in Frontiers in Genetics, as “Mitochondrial Heterogeneity” Juvid, Iain and Nick.

Saturday, 20 October 2018

Mutated islands of brain matter from development might be common in the human population

You are made up of a lot of cells, and so is your brain. You were also derived from a single cell: the union of a sperm and an egg. In order for your body to grow from a single cell into an adult human, a massive amount of cell division must occur, which means that the DNA inside your cells must also be replicated intensively. In copying all of this DNA, “spelling mistakes” can sometimes be made. If that mistake occurs early enough in development, all of the subsequent cells which are copied from the mutant parent also receive this mistake, which potentially gives rise to large islands of mutated cells (called “somatic mosaicism”). If a copying error occurs at a particularly important base of DNA, this could potentially cause disease in the tissue once you have fully developed into an adult. 

Inherited mutations in certain genes are known, in rare cases, to cause neurodegenerative disease (such as Alzheimer's and Parkinson's disease). We wondered whether non-heritable “spelling mistakes” in these disease-causing genes is common enough in the human population to potentially explain the more common forms of neurodegenerative disease. 

Our experimental collaborators at the University of Cambridge went searching for mutated chunks of brain matter in post-mortem samples of brains from 54 human individuals. Using genetic sequencing technology, they found evidence for these mutated islands of grey matter. However, none of these samples were pathological themselves, since only a small fraction of the brain per individual was sampled. This provided an opportunity for mathematical modelling of how the brain develops, so that we may predict the prevalence of pathological mutations in human brains, given the experimental data. 

Our mathematical model is incredibly simple (and crude -- others have developed much more sophisticated approaches): we assume that, in order to grow a brain, you take the initial cell from which you were derived, and double it repeatedly until the mass of cells corresponds to the number of cells in the brain (this is called a binary tree). Each copying event is called a “generation”, and corresponds to a row of the tree below. If a mutation occurs whilst copying the DNA of a particular cell at a particular generation, then a fixed fraction of all the subsequent daughters will also be mutated, generating a mutant region. Repeatedly simulating brain development using a computer allows us to gather statistics about the probability of an individual harbouring pathological islands of brain matter. 

Mathematical modelling of brain development reveals that islands of pathologically mutated cells are potentially common in the population. Left: We modelled neurodevelopment as a simple binary tree, where an initial cell doubles repeatedly until the final adult brain is created. DNA copying errors are carried forward into daughter cells. Right: A typical simulated individual. Coloured circles represent islands of pathologically mutated cells in the adult human brain. Whole brain area (black circle) is not to scale with the mutated regions (coloured circles). The mutated regions are really tiny proportionately. 

Neurodevelopment is, of course, much more complicated than a series of doubling events. Amongst other effects, regions spatially re-arrange themselves, cells die, and cell division isn't always symmetric (i.e. daughter cells may not always be capable of dividing themselves). We explored several modifications to the simple model above, and found that our extrapolations were surprisingly robust. We argue that, once the developing human brain consists of about 1 million cells, as long as each daughter cell gives rise to roughly the same number of daughter cells in subsequent divisions, and that spatial mixing of the brain isn't too strong, every individual is expected to harbour about 1 pathologically mutated island of cells consisting of about 10,000 to 100,000 cells. The basic idea is that if those 1 million cells replicate once then they are really likely to have a pathological mutation crop up in one of those 1 million divisions. Larger regions may also occur, but are rarer, and conversely, smaller regions are more common (see the right panel of the figure above). This kind of argument suggests that for a whole range of possible ways in which our brain develops we’re likely to have islands of mutation. 

We also discuss an observation which emerges from the tree-structure of neurodevelopment which may allow us to directly estimate the mutation rate from a simple back-of-the-envelope calculation. Any particular experiment will have a certain detection sensitivity, in that it will be able to detect mutations common to a minimum number of cells in a sample, and no fewer (in our case this was ~0.5% of cells in a sample). Because of the tree structure of neurodevelopment, the most common mutations observed will occur at exactly the detection sensitivity: larger mutated islands become exponentially rarer, whereas smaller mutations are too small to be measurable. 

Now consider cutting a whole brain up into a number of equally-sized chunks. As the size of the chunks increases (where we are able to detect mutations affecting 0.5% of each chunk), each chunk is tuning into a mutation event which affects more cells, and therefore higher up in the neurodevelopmental tree. But, the number of mutated cells from any particular generation in the tree is a constant: mutations high up in the tree are larger, but also rarer, and these two effects precisely balance each other. Therefore, regardless of how large each chunk is, the total number of cells which you expect to be able to detect is independent of chunk size. The total number of detectably mutated cells does, of course, depend on the mutation rate and the total number of bases that are sequenced. Furthermore, we may say that the total number of detectably mutated cells equals: (number of mutated chunks) x (number of mutated cells per mutated chunk): this argument itself is also independent of the size of each chunk, only depending on the detection sensitivity and the fraction of detectably mutated chunks across the whole experiment. Therefore, we may equate the total number of mutations from any given generation, and the total number of detectably mutated, to write down the mutation rate entirely independently of the size of each brain chunk. Another way of putting the above argument is that (for very simple models of the brain we describe) we expect that the quantity: (the fraction of chunks containing mutation) x (sensitivity of the detection technique), is an invariant directly linked to the mutation rate (specifically to the number of mutations expected in a single replication of the sequences studied). This doesn't depend on the size of the chunks or the size of the brain. As such, if our experiment had half the sensitivity to mutated cells per brain chunk (so 1% instead of 0.5%), we’d have had to measure twice as many bits of brain to obtain a similar number of detectably mutated brain chunks. It's obviously crude but helpful for insight -- and we're order-of-magnitude enthusiasts (see this and this and this

Overall, our results suggest that pathologically mutated islands of brain matter are potentially possessed by all of us. These islands may potentially be sources of protein aggregates, which could spread in the brain and cause neurodegeneration; perhaps they’re regions which could be thought of as randomly triggering pathology sometime over our lives with the rate of triggering proportional to the size of the region. Future work is required to verify this, by direct observation of pathologically mutated islands, and mechanistic studies to quantify how large an island is “large enough” to have a high chance of inducing disease within a human lifespan. 

You can freely access our work, which has recently been published in Nature Communications, as "High prevalence of focal and multi-focal somatic genetic variants in the human brain" Juvid, Nick and our friends in the Department of Clinical Neurosciences at the University of Cambridge especially Mike, Wei and Patrick.

catch22 features of signals

  By taking repeated measurements over time we can study the dynamics of our environment – be it the mean temperature of the UK by month, th...