Thursday, 10 April 2014

What's the difference? Telling apart two sets of signals

We are constantly observing ordered patterns all around us, from the shapes of different types of objects (think of different leaf shapes, yoga poses), to the structured patterns of sound waves entering our ears and the fluctuations of wind on our faces. Understanding the structure in observations like these have much practical utility: For example, how do we make sense of the ordered patterns of heart beat intervals for medical diagnosis, or the measurements of some industrial process for quality checking? We have recently published an article that automatically learns the discriminating structure in labeled datasets of ordered measurements (or time series or signals)---that is, what is it about production-line sensor measurements that predict a faulty process, or what is it about the shape of Eucalyptus leaves that distinguish them from other types of leaves?

Conventional methods for comparing time series (within the area of time-series data mining) involve comparing their measurements through time, often using sophisticated methods (with science fiction names like "dynamic time warping") that squeeze together pairs of time series patterns to find the best match. This approach can be extremely powerful, allowing new time series to be classified (e.g., in the case of a heart beat measurement, labelling it as a "healthy" heart beat or a "congestive heart failure"; or in the case of leaf shapes, labelling it as "Eucalyptus", "Oak", etc.), by matching them to a database of known time series and their classifications. While this approach can be good at telling you whether your leaf is a "Eucalyptus", it does not provide much insight into what it is about Eucalyptus leaves that is so distinctive. It also requires one to compare a new leaf to all other leaves in your database, which can be an intensive process. 


A) Comparing time series by alignment B) Comparing time series by their structural features: in this we probe many structural features of the time series simultaneously (ii) and then distil out the relevant ones (iii).
Our method learns the properties of a given class of time series (e.g., the distinguishing characteristics of Eucalyptus leaves) and classifies new time series according to these learned properties. It does so by comparing thousands of different time-series properties simultaneously, that we developed in previous work that we blogged about here. Although there is a one-time cost to learn the distinguishing properties, this investment provides interpretable insights into the properties of a given dataset (this kind of task is very useful for scientists when they want to understand the difference between their control data and the data from their experimental interventions) and can allow new time series to be classified rapidly. The result is a general framework for understanding the differences in structure between sets of time series. It can be used to understand differences between various types of leaves, heart beat intervals, industrial sensors, yoga poses, rainfall patterns, etc. and is a contribution to helping the data science/ big-data/ time-series data mining literature deal with...bigger data.
Each of the dots corresponds to a time series. The colours correspond to (computer generated) time series of six different types. We identify features that allow us to do a good job of distinguishing these six types.

Our work will be appearing with the name "Highly comparative, feature based, time-series classification" in the acronymically titled IEEE TKDE and you can find a free version of it here. Ben and Nick.

2 comments: