The Vertebrate Genomes Project (VGP) today announces their flagship study and associated publications focused on genome assembly quality and standardization for the field of genomics. This study includes 16 diploid high-quality, near error-free, and near complete vertebrate reference genome assemblies for species across all taxa with backbones (i.e., mammals, amphibians, birds, reptiles, and fishes) from five years of piloting the first phase of the VGP project.
In a special issue of Nature, with companion papers simultaneously published in other scientific journals, the VGP details numerous technological improvements based on these 16 genome assemblies. In the flagship study, the VGP demonstrates the feasibility of setting and achieving high-quality reference genome quality metrics using their state-of-the-art automated approach of combining long-read and long-range chromosome scaffolding approaches with novel algorithms that put the pieces of the genome assembly puzzle together.
Growing out of the decade-old mission of Genome 10K Community of Scientists (G10K) to sequence the genomes of 10,000 vertebrate species and other comparative genomics efforts, the VGP is taking advantage of dramatic improvements in sequencing technologies in the last few years to begin production of high-quality reference genome assemblies for all ~70,000 living vertebrates. To date, the current VGP pipelines have led to the submission of 129 diploid assemblies representing the most complete and accurate versions of those species to date and is on the path to generating thousands of genome assemblies, demonstrating feasibility in not only quality standardization but also scale.
"When I was asked to take on leadership of the G10K in 2015, I emphasized the need to work with technology partners and genome assembly experts on approaches that produce the highest quality data possible, as it was taking months per gene for my students and postdocs to correct gene structure and sequences for their experiments, which was causing errors in our biological studies", said Erich Jarvis, lead of the VGP sequencing hub at The Rockefeller University, Chair of the G10K and a Howard Hughes Medical Institute Investigator. "For me this was not only a practical mission, but a moral imperative."
Arang Rhie, first author of the flagship paper from the National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA, adds, "It truly was a challenge to design a pipeline applicable to highly diverged genomes. Our largest genome, 5 Gb in size, broke almost every tool commonly used in assembly processes. The extreme level of heterozygosity or repeat contents posed a big challenge. This is just the beginning; we are continuously improving our pipeline in response to new technology improvements."
The VGP's approach combines assembly pipelines with manual curation to fix misassemblies, major gaps, and other errors, which informs the iterative development of better algorithms. For example, the VGP helped reveal high levels of false gene duplications, losses or gains, due mostly to algorithms not properly separating maternal and paternal chromosomes. One solution includes a trio binning approach of using DNA from the parents to separate out the paternal and maternal sequences in the offspring. For cases where parental data is unavailable, another solution developed by the VGP and collaborators is an algorithm called FALCON-Phase that reduces the computational complexity of phasing maternal and paternal DNA sequences at chromosome scale.
Kerstin Howe, lead of the curation team at the Wellcome Sanger Institute in the UK, says, "Our new approach to produce structurally validated, chromosome-level genome assemblies at scale will be the foundation of ground-breaking insights in comparative and evolutionary genomics."
Adam Phillippy, chair of the VGP genome assembly and informatics working group of over 100 members and head of the Genome Informatics Section of the National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA, adds, "Completing the first vertebrate reference genome, human, took over 10 years and $3 billion dollars. Thanks to continued research and investment in DNA sequencing technology over the past 20 years, we can now repeat this amazing feat multiple times per day for just a few thousand dollars per genome."
The excellent quality of these genome assemblies enables unprecedented novel discoveries which have implications for characterizing biodiversity for all life, conservation, and human health and disease. The first high-quality reference genomes of six bat species, generated with the Bat 1K consortium, revealed selection and loss of immunity-related genes that may underlie bats' unique tolerance to viral infection. This finding provides novel avenues of research to increase survivability, particularly relevant for emerging infectious diseases, such as the current COVID-19 pandemic.
Specific to conservation and in collaboration with the Māori in New Zealand and officials in Mexico, genomic analyses of the kākāp?, a flightless parrot, and the vaquita, a small porpoise and the most endangered marine mammal, respectively, suggest evolutionary and demographic histories of purging harmful mutations in the wild. The implication of these long-term small population sizes at genetic equilibrium gives hope for these species' survival.
Richard Durbin, a Professor at the University of Cambridge and lead of the VGP sequencing hub at the Wellcome Sanger Institute in the UK, says, "These studies mark the start of a new era of genome sequencing that will accelerate over the next decade to enable genomic applications across the whole tree of life, changing our scientific interactions with the living world."
Gene Myers, lead of the VGP sequencing hub at the Max Planck Institute in Dresden, Germany, elaborates, "The VGP project is at the vanguard of the creation of a genomic catalog in analogy with Linnaeus' classification of life. I and my colleagues in Dresden are excited to be contributing superb genome reconstructions with the funding of the Max-Planck Society of Germany."
The VGP involves hundreds of international scientists working together from more than 50 institutions in 12 different countries since the VGP was initiated in 2016 and is exemplary in its scientific cooperation, extensive infrastructure, and collaborative leadership. Additionally, as the first large-scale eukaryotic genomes project to produce reference genome assemblies meeting a specific minimum quality standard, the VGP has thus become a working model for other large consortia, including the Bat 1K, Pan Human Genome Project, Earth BioGenome Project, Darwin Tree of Life, and European Reference Genome Atlas, among others.
As a next step, the VGP will continue to work collaboratively across the globe and with other consortia to complete Phase 1 of the project, approximately one representative species per 260 vertebrate orders separated by a minimum of 50 million years from a common ancestor with other species in Phase 1. The VGP intends to create comparative genomic resources with these 260 species, including reference-free whole genome alignments, that will provide a means to understand the detailed evolutionary history of these species and create consistent gene annotations. Genome data are primarily generated at three sequencing hubs that have invested in the mission of the VGP including The Rockefeller University's Vertebrate Genome Lab, New York, USA; Wellcome Sanger Institute, UK; and Max Planck Institute, Germany.
Phase 2 will focus on representative species from each vertebrate family and is currently in the progress of sample identification and fundraising. The VGP has an open-door policy and welcomes others to join its efforts, ranging from fundraising and sample collection to generating genome assemblies or including their own genome assemblies that meet the VGP metrics as part of our overall mission.
The VGP collaborated with and tested many protocols from genome sequencing companies, some of whose scientists are also co-authors of the flagship study, including from Pacific Biosciences, Oxford Nanopore Technologies, Illumina, Arima Genomics, Phase Genomics, and Dovetail Genomics. The VGP also collaborated with DNAnexus and Amazon to generate a publicly available VGP assembly pipeline and host the genomic data in the Genome Ark database. The genomes, annotations and alignments are also available in international public genome browsing and analyses databases, including the National Center for Biotechnology Information Genome Data Viewer, EMBL-EBI Ensembl genome browser, and UC Santa Cruz Genomics Institute Genome Browser. All data are open source and publicly available under the G10K data use policies.
Other novel biological discoveries from the 16 genomes in the flagship paper, and 25 genomes total from over 20 papers in this first wave of publications include:
- Corrections of false gene or chromosome losses, where previous assemblies missed between 30% to 50% of GC-rich protein-coding gene regulatory regions, which were considered to belong to the 'dark matter' of the genome;
- Newly identified chromosomes in the zebra finch and platypus;
- Complete and error free mitochondrial genomes for most species, some generated in single molecule sequences without the need for assembly;
- Unusual sex chromosome evolution in monotreme mammals and birds;
- Genetic variations between humans and marmosets that have implications for marmosets as an emerging non-human primate model system for biomedical research;
- Lineage-specific changes shaping the evolution of bird and mammal genomes: duck, emu and platypus and echidna; and
- Proposal for a universal evolution-based revised nomenclature for the oxytocin and vasotocin ligand and receptor families.
Links to all of the reports related to this package can be found here on Nature's website.
###The Rockefeller University
The Rockefeller University is one of the world's leading biomedical research university and is dedicated to conducting innovative, high-quality research to improve the understanding of life for the benefit of humanity. The university's 70 laboratories conduct research in neuroscience, immunology, biochemistry, genomics, and many other areas. A community of 2,000 faculty, students, postdocs, technicians, clinicians, and administrative personnel work on our 16-acre Manhattan campus. Our unique approach to science has led to some of the world's most revolutionary and transformative contributions to biology and medicine. During Rockefeller's 120-year history, our scientists have won 26 Nobel Prizes, 24 Albert Lasker Medical Research Awards, and 20 National Medals of Science.
The Vertebrate Genome Laboratory at The Rockefeller University
The Vertebrate Genome Laboratory (VGL) at the Rockefeller University specializes in long-read genomic technologies. The VGL is one of the three VGP sequencing hubs. It is equipped with cutting-edge genomic technologies including several Pacific Biosciences and Oxford Nanopore sequencers, a Bionano Genomics Saphyr optical mapper, and a 10x Genomics Chromium microfluidics instrument. Composed of a team of experts in long reads and ultra-High-Molecular Weight DNA, the VGL strives to find a way to decipher life's blueprint from any samples, even the most challenging ones. Using state of the art technologies and extensive international collaborations, we are devoted to fill the gap between field scientists and geneticists. We are particularly proud to play our small part in the effort of reversing species extinction by sequencing genomes of endangered species before it is too late. Learn more about us at http://vertebrategenomelab.org and follow us on twitter at @genomewarriors
The Wellcome Sanger Institute
The Wellcome Sanger Institute is a world leading genomics research centre. We undertake large-scale research that forms the foundations of knowledge in biology and medicine. We are open and collaborative; our data, results, tools and technologies are shared across the globe to advance science. Our ambition is vast - we take on projects that are not possible anywhere else. We use the power of genome sequencing to understand and harness the information in DNA. Funded by Wellcome, we have the freedom and support to push the boundaries of genomics. Our findings are used to improve health and to understand life on Earth. Find out more at http://www.sanger.ac.uk or follow us on Twitter, Facebook, LinkedIn and on our Blog.
The Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG) in Dresden, Germany is one of more than 80 institutes of the Max Planck Society, an independent, non-profit organization in Germany. 500 curiosity-driven scientists from over 50 countries ask: How do cells form tissues? The basic research programs of the MPI-CBG span multiple scales of magnitude, from molecular assemblies to organelles, cells, tissues, organs, and organisms.
National Human Genome Research Institute, National Institutes of Health
NHGRI is one of the 27 institutes and centers at the National Institutes of Health. At NHGRI, we are focused on advances in genomics research. Building on our leadership role in the initial sequencing of the human genome, we collaborate with the world's scientific and medical communities to enhance genomic technologies that accelerate breakthroughs and improve lives. By empowering and expanding the field of genomics, we can benefit all of humankind. Additional information about NHGRI can be found at https://www.genome.gov.
Howard Hughes Medical Institute
The Howard Hughes Medical Institute plays an important role in advancing scientific research and education in the United States. Its scientists, located across the country and around the world, have made important discoveries that advance both human health and our fundamental understanding of biology. The Institute also aims to transform science education into a creative, interdisciplinary endeavor that reflects the excitement of real research. HHMI's headquarters are located in Chevy Chase, Maryland, just outside Washington, DC.
San Diego Zoo Wildlife Alliance
San Diego Zoo Wildlife Alliance is a nonprofit international conservation leader, committed to inspiring a passion for nature and creating a world where all life thrives. The Alliance empowers people from around the globe to support their mission to conserve wildlife through innovation and partnerships. San Diego Zoo Wildlife Alliance supports cutting-edge conservation and brings the stories of their work back to the San Diego Zoo and San Diego Zoo Safari Park--giving millions of guests, in person and virtually, the opportunity to experience conservation in action. The work of San Diego Zoo Wildlife Alliance extends from San Diego to strategic and regional conservation "hubs" across the globe, where their strengths--via their "Conservation Toolbox," including the renowned Wildlife Biodiversity Bank--are able to effectively align with hundreds of regional partners to improve outcomes for wildlife in more coordinated efforts. By leveraging these tools in wildlife care and conservation science, and through collaboration with hundreds of partners, San Diego Zoo Wildlife Alliance has reintroduced more than 44 endangered species to native habitats. Each year, San Diego Zoo Wildlife Alliance's work reaches over 1 billion people in 150 countries via news media, social media, their websites, educational resources and the San Diego Zoo Kids channel, which is in children's hospitals in 13 countries. Success is made possible by the support of members, donors and guests to the San Diego Zoo and San Diego Zoo Safari Park, who are Wildlife Allies committed to ensuring All Life Thrives.
UC Santa Cruz Genomics Institute
Comprising diverse researchers from a variety of disciplines across academic divisions, the UC Santa Cruz Genomics Institute leads UC Santa Cruz's efforts to unlock the world's genomic data and accelerate breakthroughs in health and evolutionary biology. Our platforms, technologies, and scientists unite global communities to create and deploy data-driven, life-saving treatments and innovative environmental and conservation efforts. We are revealing life's code™. Learn more at genomics.ucsc.edu.