For a quarter century, GenBank has helped advance scientific discovery worldwide. Established by the National Institutes of Health (NIH) in 1982, the database of nucleic acid sequences is one of the key tools that scientists use to conduct biomedical and biologic research. Since its creation, GenBank has grown at an exponential rate, doubling in size every 18 months. In celebration of this vital resource and its contribution to science over the last 25 years, the National Center for Biotechnology Information, National Library of Medicine (NLM), NIH, is holding a two-day conference on GenBank.
The conference will take place April 7-8, 2008 at the Natcher
Conference Center on the main NIH campus in Bethesda, Maryland. For
details on the meeting, see the conference Web site, at
The conference will bring together a slate of world-renowned scientists in molecular biology, genetics, bioinformatics and other areas to discuss GenBank's applications, the discoveries it has enabled, its history, and future directions. Speakers include Rich Roberts, Ph.D., a Nobel Prize winner for his discoveries of split genes, and currently Chief Scientific Officer at New England BioLabs; Sydney Brenner, Ph.D., a Nobel Prize winner for his work on genetic regulation of organ development and programmed cell death, and currently a professor at the Salk Institute; Francis Collins, M.D., Ph.D., who led the Human Genome Project and is Director of NIH's National Human Genome Research Institute; and Craig Venter, Ph.D., who led the private-sector effort to sequence the human genome and is President of the J. Craig Venter Institute. More than a dozen other eminent scientists will be speaking; the full list of presenters can be viewed at the GenBank conference Web site.
"Each day, researchers across the world submit tens of thousands of sequences to GenBank and collaborating databases in Europe and Japan," said Donald A. B. Lindberg, M.D., Director of the National Library of Medicine. "Because of these contributions, GenBank has become an essential tool for molecular biology. The National Library of Medicine is proud to partner with the research community in making this valuable resource available."
Rich Roberts, Ph.D., Chief Scientific Officer at New England BioLabs, commented, "GenBank has provided a foundation upon which much of contemporary biology is now based. It is becoming almost impossible to conceive of any serious biological study of a new organism that does not begin with the determination of its DNA sequence, which of course must be stored in GenBank." Roberts, one of the early proponents of the database, added, "the availability of this wealth of sequence information in a single repository is something we could only dream about in 1979 at the Rockefeller Conference that led to its creation and which we could not imagine being without today."
When scientists first began sequencing proteins and DNA it was an expensive and time consuming process, leading researchers to usually limit their sequencing to those genes and proteins for which they had a particular interest. A small number of groups began collecting sequencing data and would sometimes do comparisons that led to serendipitous discoveries, for example that two proteins were related evolutionarily.
By the late 1970s consensus was emerging about the need for an international computer database of nucleic acid sequence data. In particular, a 1979 workshop sponsored by the National Science Foundation and held at Rockefeller University resulted in a call for such a database and development of analysis tools. NIH held a series of workshops the following two years to define the project and subsequently issued a request for proposals. In 1982, NIH awarded a five-year contract for the nucleic acid sequence database to the private firm of Bolt, Beranek and Newman with a subcontract to Los Alamos National Laboratory, marking the official beginning of GenBank.
A significant leap forward came shortly thereafter in the area of analysis tools: In early 1983, two NIH researchers (John Wilbur, M.D., Ph.D., and David Lipman, M.D.) published an algorithm that allowed data banks to be searched for sequences similar to the queried sequence in a matter of 2 or 3 minutes. This markedly accelerated the science, making it easier for researchers to routinely do sequence comparisons. Further advances in analysis tools followed, such as the 1990 introduction of BLAST (Basic Local Alignment Search Tool), which can search GenBank for similar sequences in mere seconds.
Shortly after GenBank was established, discussions began with the European Molecular Biology Laboratory (EMBL), which had established its own data bank. Within a couple of years GenBank and EMBL were collaborating, and by the mid-1980s, the DNA Data Bank of Japan (DDBJ) joined in. The three groups now exchange data daily under what is known as the International Nucleotide Sequence Database Collaboration (INSDC).
Growth of the databases was further stimulated by scientific journals, which began requiring authors to get accession numbers from GenBank, EMBL or DDBJ for articles that included sequences. In 1987, NIH issued a second five-year contract, this time to the firm of IntelliGenetics with a subcontract to Los Alamos National Laboratory. When the contract ended in 1992, GenBank was moved to the National Center for Biotechnology Information (NCBI), a division of NIH's National Library of Medicine that was established in 1988 under the leadership of BLAST co-developer David Lipman.
Today, GenBank continues to be operated by NCBI, which has integrated it with dozens of other biological databases - such as genome maps and protein structures - as well as the scientific literature (via its PubMed and PubMed Central databases) and tools for analysis. Improvements in sequencing technologies and reduced sequencing costs are resulting in massive increases in the quantity of data produced, in turn driving exponential growth in GenBank, which currently contains data on about 110 million sequences and 200 billion base pairs.
"GenBank has been a critical research tool, enabling much of the progress that has been made over the last two decades in understanding biological function and genetics," said Lipman, Director of NCBI and a speaker at the GenBank conference. "The value of the database will only expand as it continues to grow, new computational tools are introduced, and the data are further integrated with other relevant data."
What does it contain? GenBank is a comprehensive database of publicly available annotated nucleotide sequences for more than 240,000 named organisms. The sequences include messenger RNA segments with coding regions, segments of genomic DNA with a single gene or multiple genes, and entire genomes. The number of base pairs in GenBank doubles about every 18 months; currently the database includes approximately 110 million sequences and 200 billion base pairs.
Where do the data come from" GenBank is an archive of primary sequence data that has been provided by those who conduct the sequencing, mostly individual labs and large-scale sequencing projects. GenBank exchanges data daily with its two partners in the International Nucleotide Sequence Database Collaboration (INSDC): the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ).
What is GenBank's relationship to the Human Genome Project" Initiated in 1990, the Human Genome Project was a 13-year effort coordinated by the U.S. Department of Energy and NIH that aimed, among other things, to determine the sequences of the 3 billion chemical base pairs that make up human DNA. The sequence data were submitted to GenBank as they were generated.
What sorts of discoveries have been made using GenBank" Analyses of GenBank sequences are an indispensable and regular part of the process of characterizing gene functions, which underlies many important advances in science and medicine. GenBank has also proven invaluable in identifying disease. One example, in November 2005, involved identification of the first polio case in the U.S. since 1999. A state health laboratory in Minnesota had isolated an unknown virus from a child from an Amish community who was thought to be suffering from an intestinal virus. When the laboratory determined the virus's DNA code, they searched against the sequences in GenBank and found not only that it was a polio virus, but that it specifically matched the strain of the virus used in the Sabin oral vaccine. More recently, scientists investigating the die-off of honeybees (colony collapse disorder) ran the sequences from diseased bee hives through GenBank and found a strong correlation with Israeli acute paralysis virus.
Where can I learn more? A good place to start is the homepage for
GenBank, at http://www.
Established in 1988 as a national resource for molecular biology
information, NCBI creates public databases, conducts research in
computational biology, develops software tools for analyzing
molecular and genomic data, and disseminates biomedical information,
all for the better understanding of processes affecting human health
and disease. NCBI is a division of the National Library of Medicine
at the NIH. For more information, visit
The National Library of Medicine is the world's largest library of
the health sciences. It is located on the NIH campus in Bethesda,
Maryland. For more information, visit the Web site at
The National Human Genome Research Institute (NHGRI) led the National Institutes of Health's (NIH) contribution to the International Human Genome Project, which had as its primary goal the sequencing of the human genome. This project was successfully completed in April 2003. Now, the NHGRI's mission has expanded to encompass a broad range of studies aimed at understanding the structure and function of the human genome and its role in health and disease. Additional information about NHGRI can be found at its Web site, www.genome.gov.