A new study demonstrates how text mining of electronic health records can be used to create medical term profiles of patients, which can be used both to identify co-occurrence of diseases and to cluster patients into groups with highly similar clinical features. The study, carried out in Denmark by a multi-disciplinary group of bioinformaticians, systems biologists and clinicians, will be published in the open-access journal PLoS Computational Biology on 25th August 2011.
Health records contain detailed phenotypic information on the clinical profile of each individual patient; however, a large part of the clinical features are described in free text produced by hospital staff often covering many years of hospitalization.
"Using our text mining approach on the free text in the records, we identified roughly ten times as many medical terms characterizing each patient as were manually included by the hospital staff. Worldwide, the manually inserted medical terms in medical records are heavily biased by local practice and billing purposes. Using our method we obtained a much more fine-grained clinical characterization of each patient, which ultimately also may be very valuable for choosing personalized treatment regimes", says Professor Søren Brunak from the Technical University of Denmark and the University of Copenhagen who led the team behind the research project.
The team used the "International Classification of Disease" terminology, maintained by the WHO as a controlled vocabulary, as the basis for the analysis. "The fact that terminologies like ICD have been translated word by word between languages makes it possible in principle to use the same term profiles across language barriers and combine cohorts across countries" says author Professor Lars Juhl Jensen from the University of Copenhagen.
The research group identified a large number of diseases and symptoms which co-occur much more than expected when compared to the individual frequencies of the diseases. The group subsequently mapped these correlations to the genetic level by investigating gene overlaps in protein interaction networks already linked to the individual diseases. "The aim here is to discover a possible genetic cause behind the disease correlations observed, thus interfacing the electronic patient record data directly to the DNA sequencing of human individuals", says Brunak.
FINANCIAL DISCLOSURE: The work carried out in this study was supported by the Villum Kann Rasmussen Foundation: www.vkr-fondene.dk, the Novo Nordisk Foundation: http://www.novonordiskfonden.dk/ and the Danish Strategic Research Council: www.fi.dk/raad-og-udvalg/det-strategiske-forskningsraad. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
COMPETING INTERESTS: The authors have declared that no competing interests exist.
CITATION: Roque FS, Jensen PB, Schmock H, Dalgaard M, Andreatta M, et al. (2011) Using Electronic Patient Records to Discover Disease Correlations and Stratify Patient Cohorts. PLoS Comput Biol 7(8): e1002141. doi:10.1371/journal.pcbi.1002141
Center for Biological Sequence Analysis
Dept. of Systems Biology, Technical University of Denmark,
Lars Juhl Jensen
University of Copenhagen
This press release refers to an upcoming article in PLoS Computational Biology. The release is provided by journal staff, or by the article authors and/or their institutions. Any opinions expressed in this release or article are the personal views of the journal staff and/or article contributors, and do not necessarily represent the views or policies of PLoS. PLoS expressly disclaims any and all warranties and liability in connection with the information found in the releases and articles and your use of such information.
PLoS Journals publish under a Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.5/), which permits free reuse of all materials published with the article, so long as the work is cited (e.g., Brinkworth RSA, O'Carroll DC (2009) Robust Models for Optic Flow Coding in Natural Scenes Inspired by Insect Biology. PLoS Comput Biol 5(11): e1000555. doi:10.1371/journal.pcbi.1000555). No prior permission is required from the authors or publisher. For queries about the license, please contact the relative journal contact indicated here: http://www.plos.org/journals/embargopolicy.php
About PLoS Computational Biology
PLoS Computational Biology (www.ploscompbiol.org) features works of exceptional significance that further our understanding of living systems at all scales through the application of computational methods. All works published in PLoS Computational Biology are open access. Everything is immediately available subject only to the condition that the original authorship and source are properly attributed. Copyright is retained.
About the Public Library of Science
The Public Library of Science (PLoS) is a non-profit organization of scientists and physicians committed to making the world's scientific and medical literature a freely available public resource. For more information, visit http://www.plos.org.
AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert! system.