Public Release: 

New, more effective option for gene data mining identified

Penn State

A new approach to identifying patterns in gene expression analysis has been shown to be more effective than the most popular method in a joint Penn State and University at Buffalo study.

Using two published gene expression data sets as test cases, the research team found that the KL clustering method, which uses a novel measure of similarity not previously used for gene expression analysis, was superior to the most popular method, hierarchical clustering, in separating the data into dense clusters with similar patterns.

In gene expression analysis, the identification of groups of genes with similar temporal patterns of expression is usually a critical step because it provides insights into gene-gene interactions and the underlying biological processes. Experiments suggest that genes with similar function may exhibit similar temporal patterns of co-regulation.

Dr. Raj Acharya, professor and head of the Department of Computer Science and Engineering at Penn State, says that, although the study was conducted with gene data, KL clustering could be applied to any large set of temporal data.

The team published their findings in a paper, "An information theoretic approach for analyzing temporal patterns of gene expression," in the March issue of the journal, Bioinformatics. The authors are Jyotsna Kasturi, Penn State doctoral candidate, Acharya, and Dr. Murali Ramanathan, Department of Pharmaceutical Sciences, University at Buffalo, The State University of New York.

Kasturi explains, "We wanted gene expression data with similar patterns to be put in the same cluster with as little variation as possible, which implies dense clusters."

The team also used the Davies-Bouldin cluster validity index as a primary measure of quality as well as a statistical measure using the chi-square test to assess similarity between the clusters obtain by the different methods.

Acharya notes "Even simple visual inspection showed that the KL method created clusters that were better separated compared to the most popular method. The evaluation with quantitative measures confirmed the visual observations."

The KL method uses the KL divergence to measure the similarity between two gene expression profiles and a self-organizing map algorithm for clustering. The clustering can be compared to creating a series of bins, each containing a different colored ball.

The algorithm sorts data into each bin according to how closely its "color" matches the ball already in the bin. The result is a set of bins or clusters densely packed with genes that exhibit similar patterns of expression and that appear well separated on visual inspection.

In the hierarchical method, the algorithm looks at all the data points and puts the two closest in one bin. It forms more bins by considering the remaining data two points at a time. This approach creates more bins than KL clustering and several bins contain only a few data points.

###

The study was supported by a grant to the Ramanathan laboratory from the National Multiple Sclerosis Society and the Archarya laboratory from the National Science Foundation.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.