While short time series datasets such as presented here are becoming more common, there are still few choices for clustering that are tailored towards this type of data. Here, we examine the data using two non parametric clustering algorithms. The first is the Short Time series Expression selleck chemicals Miner algorithm and software devel oped by Ernst et al. where all genes are clustered into one of a set of pre defined patterns based on transfor mation of gene profiles into units of change. Then, clusters are assigned significance levels using a permutation test based method. Second, we apply a clustering method proposed in that uses the Parti tioning Around Medoids algorithm, which we have called the Feature Based PAM Algorithm.
It employs an innovative set of features of gene expression over time, such that, the unit of analysis changes from gene expression at given time points to profile curves over the entire time horizon. Unlike alter native approaches, it does not pre specify patterns of expression and does not cluster point values using a dis tance measure or a model. The algorithm clusters biolo gically relevant features or curve summarization measures, extracted from each short time series, and then feeds these features into the PAM algorithm. PAM is very similar to the k means algorithm, chosen here because it uses median data points to determine cluster centroids instead of the mean, making it more robust to outliers. This approach is designed to be both statisti cally powerful and biologically valid.
The idea of feature selection was first used in the con text of clustering large time series data for dimension reduction, where the term dimension refers to the num ber of time points that describe the series. In these cases, a few well chosen statistics describing the dynamics of the series such as serial correlation, skew ness, and kurtosis were used to summarize the data. We also used feature selection, but in the sparse data context, as a dimension augmentation technique to effectively and appropriately describe the curve and pro vide the most complete description of the time series possible. The clustering GSK-3 features we proposed here were based on the structural characteristics of the time course data and reflect a clear link with subject matter consid erations and the questions under study. The features we used were, the vector of slopes between adjacent time points, maximum and minimum expression, time of maximum and minimum expression, and the steepest positive and negative slope. In a sense, they capture the global picture of an admittedly short time horizon of expression and provide sufficient summarization of the dynamic structure of the curves.