Thursday, 10 April 2014

What's the difference? Telling apart two sets of signals

We are constantly observing ordered patterns all around us, from the shapes of different types of objects (think of different leaf shapes, yoga poses), to the structured patterns of sound waves entering our ears and the fluctuations of wind on our faces. Understanding the structure in observations like these have much practical utility: For example, how do we make sense of the ordered patterns of heart beat intervals for medical diagnosis, or the measurements of some industrial process for quality checking? We have recently published an article that automatically learns the discriminating structure in labeled datasets of ordered measurements (or time series or signals)---that is, what is it about production-line sensor measurements that predict a faulty process, or what is it about the shape of Eucalyptus leaves that distinguish them from other types of leaves?

Conventional methods for comparing time series (within the area of time-series data mining) involve comparing their measurements through time, often using sophisticated methods (with science fiction names like "dynamic time warping") that squeeze together pairs of time series patterns to find the best match. This approach can be extremely powerful, allowing new time series to be classified (e.g., in the case of a heart beat measurement, labelling it as a "healthy" heart beat or a "congestive heart failure"; or in the case of leaf shapes, labelling it as "Eucalyptus", "Oak", etc.), by matching them to a database of known time series and their classifications. While this approach can be good at telling you whether your leaf is a "Eucalyptus", it does not provide much insight into what it is about Eucalyptus leaves that is so distinctive. It also requires one to compare a new leaf to all other leaves in your database, which can be an intensive process. 

A) Comparing time series by alignment B) Comparing time series by their structural features: in this we probe many structural features of the time series simultaneously (ii) and then distil out the relevant ones (iii).
Our method learns the properties of a given class of time series (e.g., the distinguishing characteristics of Eucalyptus leaves) and classifies new time series according to these learned properties. It does so by comparing thousands of different time-series properties simultaneously, that we developed in previous work that we blogged about here. Although there is a one-time cost to learn the distinguishing properties, this investment provides interpretable insights into the properties of a given dataset (this kind of task is very useful for scientists when they want to understand the difference between their control data and the data from their experimental interventions) and can allow new time series to be classified rapidly. The result is a general framework for understanding the differences in structure between sets of time series. It can be used to understand differences between various types of leaves, heart beat intervals, industrial sensors, yoga poses, rainfall patterns, etc. and is a contribution to helping the data science/ big-data/ time-series data mining literature deal with...bigger data.
Each of the dots corresponds to a time series. The colours correspond to (computer generated) time series of six different types. We identify features that allow us to do a good job of distinguishing these six types.

Our work will be appearing with the name "Highly comparative, feature based, time-series classification" in the acronymically titled IEEE TKDE and you can find a free version of it here. Ben and Nick.

Wednesday, 9 April 2014

Polyominoes: mapping genotypes to phenotypes

Biological evolution sculpts the natural world and relies on the conversion of genetic information (stored as sequences, usually of DNA, called genotypes) into functional physical forms (called phenotypes). The complicated nature of this conversion, which is called a genotype-phenotype (or GP) map, makes the theoretical study of evolution very difficult. It is hard to say how a population of individuals may evolve without understanding the underlying GP map.

This is due to the two fundamental forces of evolution -- mutations and natural selection -- acting on different aspects of an organism. Mutations occur to genotypes (G), while natural selection, the ultimate adjudicator of the fate of mutations in the population, acts on the phenotype (P). Without understanding the link between these two -- the GP map -- we can't easily say, for example, how many mutations we expect important proteins within a virus strain to undergo with time, and thus how quickly the virus will evolve to be unrecognised by our immune systems.

Simple models for the mapping of genotype to phenotype have helped answer important questions for some model biological systems, such as RNA molecules and a coarse-grained model of protein folding. One important class of biological structure which has not yet been modelled in this way are protein complexes: structures formed through proteins binding together, fulfilling vital biological functions in living organisms. In this work, we introduce the "polyomino" model, based on the self-assembly of interacting square tiles to form polyomino structures. The square tiles that make up a polyomino are assigned different "sticky patches", modelling the interactions between different proteins that form a complex. A huge range of structures can be formed by varying the details of these patches, mimicking the range of protein complexes that exist in biology (though there are some obvious differences in the shapes of structures that can be formed).
Our simple model explores the interactions between protein subunits, and how these interactions shape a surface that evolution explores. (top) Sickle-cell anemia involves a mutation that changes the way proteins interact, making normally independent units form a dangerous extended structure. (bottom) Our polyomino model models this effect. The resultant dramatic effects on structure, fitness, and evolution can then be explored.
Despite its abstraction we show that the polyomino model displays several important features which make it a potentially useful model for the GP map underlying protein complex evolution. On top of this, we demonstrate that our model possesses similar properties to RNA and protein folding models, interestingly suggesting that universal features may be present in biological GP maps and that the "landscapes" upon which evolution searches may thus have general properties in common. You can find the paper free here and you can read about polominoes here and play a game here. Iain

Tuesday, 1 April 2014

Fast inference about noisy biology

Biology is a random and noisy world -- as we've written about several times before! (e.g. here and here) This often means that when we try to measure something in biology -- for example, the number of a particular type of proteins in a cell, or the size of a cell -- we'll get rather different results in each cell we look at, because random differences between cells mean that the exact numbers are different in each case. How can we find a "true" picture? This is rather like working out if a coin is biased by looking at lots of coin-flip results.

Measuring these random differences between cells can actually tell us more about the underlying mechanisms for things like (to use the examples above) the cellular population of proteins, or cellular growth. However, it's not always straightforward to see how to use these measurements to fill out the details in models of these mechanisms. A model of a biological process (or any other process in the world) may have several "parameters" -- important numbers which determine how the model behaves (the bias of a coin, is an example, telling us what proportion of times we'll see heads). These parameters may include, for example, rates with which proteins are produced and degraded. The task of using measurements to determine the values of these parameters in a model is generally called "parametric inference". In a new paper, I describe a new and efficient way of performing this parametric inference given measurements of the mean and variance of biological quantities. This allows us to find a suitable model for a system describing both the average behaviour and typical departures from this average: the amount of randomness in the system. The algorithm I propose is an example of approximate Bayesian computation (ABC) which allows us to deal with rather "messy" data: I also describe a fast (analytic) approach that can be used when the data is less messy (Normally distributed).

Parametric inference often consists of picking a trial set of parameters for a model and seeing if the model with those parameters does a good job of matching experimental data. If so, those parameters are recorded as a "good" set, otherwise, they're discarded as a "bad" set. The increase in efficiency in my proposed approach is due to the fact that we can perform a quick, preliminary check to see if a particular parameterisation is "bad", before spending more computer time on rigorously showing that it is "good". I show a couple of examples in which this preliminary checking (based on fast computation of mean results before using stochastic simulation to compute variances) speeds up the process by 20-50% on model biological problems -- hopefully allowing some scientists to grab a little more coffee time! This work will be coming out in the journal Statistical Applications in Genetics and Molecular Biology with the title `Efficient parametric inference for stochastic biological systems with measured variability' and you'll find the article (free) here. Iain