Reporting in the Nature Genetics journal (Nature Genetics 36, 664, 01 Jul 2004), the two scientists describe how iHOP, which was developed as part of the EU-funded ORIEL and TEMBLOR projects, converts the 14 million abstracts in the PubMed (National Library of Medicine) bibliographic database into a network of interlinked references to genes, proteins, mutations, diseases and (bio)chemical compounds. By using genes and proteins as hyperlinks between sentences and articles, iHOP makes the information stored in PubMed accessible as one navigable resource.
The technology behind iHOP is a combination of state-of-the-art components and novel in-house developments. Key features are the organisation of textual and genomic information in a relational database and the use of the latest text-mining technology for the detection of biomedical entities in natural text. Production of state data is based entirely on XML technology and avoidance of complex front-end database queries means response times are extremely fast.
Connecting biomedical concepts
While conventional keyword searches result in long and not always informative lists of abstracts, navigation along this gene-guided network allows for a stepwise and controlled exploration of the information space. The iHOP system shows that distant medical and biological concepts can be related by surprisingly few intermediate genes; the shortest path between any two genes involving on average only four steps.
Hoffmann and Valencia expect this highly connected network to trigger a revolution in new text-mining tools that will bring biomedicine within closer reach of both the scientific community and the wider public.
IHOP is available at: http://www.pdg.cnb.uam.es/UniPub/iHOP/.
iHOP was financed by the ORIEL (IST-2001-32688) and TEMBLOR (QLRT-2001-00015) projects of the European Community.
ORIEL (Online Research Information Environment for the Life Sciences) is a project focused on digital information management in the biosciences (Contract number IST-2001-32688). ORIEL responds to the needs of biologists to deal with a growing stream of information generated by genomics, imaging and other data-intensive technologies by providing tools to manage large, complex, multimedia datasets and to navigate through an increasingly intricate and potentially confusing information landscape.