The conversation
Scientists are on their way to sequencing 1 million human genomes and using big data to unlock genetic secrets
A full human genome, represented here in pairs of chromosomes, offers a wealth of information, but it is difficult to relate genetics to traits or diseases. HYanWong / Wikimedia Comons The first draft of the human genome was published 20 years ago in 2001. It lasted nearly three years and cost between $ 500 and $ 1 billion. The Human Genome Project has enabled scientists to read almost continuously the 3 billion pairs of DNA bases – or “letters” – that biologically define a person. This project has enabled a new generation of researchers like me, currently a postdoctoral fellow at the National Cancer Institute, to identify new targets for cancer treatments, engineer mice with human immune systems, and even create a website where anyone can navigate the entire human genome the same ease with which you use Google Maps. The first full genome was generated from a handful of anonymous donors in an attempt to create a reference genome that represents more than just a single individual. However, this was nowhere near the great diversity of the human population in the world. No two people are alike, and no two genomes are alike. If researchers were to understand humanity in all its diversity, thousands or millions of complete genomes would have to be sequenced. Such a project is now under way. There is tremendous genetic variation among people around the world. Flashpop / DigitalVision via Getty Images Understanding Genetic Diversity The wealth of genetic variation between people makes each person unique. However, genetic changes also cause many disorders and make some groups of people more susceptible to certain diseases than others. At the time of the human genome project, researchers were also sequencing the entire genome of organisms such as mice, fruit flies, yeasts and some plants. The tremendous effort to create these first genomes resulted in a revolution in the technology required to read genomes. Thanks to these advances, it now takes a few days and a thousand dollars instead of years and hundreds of millions of dollars to sequence an entire human genome. Genome sequencing is very different from genotyping services like 23 and Me or Ancestry, which only examine a tiny fraction of the locations in a person’s genome. Advances in technology have made it possible for scientists to sequence the entire genome of thousands of individuals from around the world. Initiatives such as the Genome Aggregation Consortia are currently endeavoring to collect and organize this scattered data. So far, this group has collected nearly 150,000 genomes that represent an incredible amount of genetic diversity in humans. Within that set, researchers have found more than 241 million differences in the genome of humans, with an average of one variant per eight base pairs. Most of these variations are very rare and have no effect on a person. However, this includes variants with important physiological and medical consequences. For example, certain variants of the BRCA1 gene predispose some groups of women, such as Ashkenazi Jews, to ovarian and breast cancer. Other variants of this gene lead to above-average mortality from breast cancer in some Nigerian women. The best way researchers can identify these types of variants at the population level is through genome-wide association studies, which compare the genomes of large groups of people with a control group. But diseases are complicated. A person’s lifestyle, symptoms, and timing of onset can vary widely, and the effects of genetics on many diseases are difficult to distinguish. The predictive power of current genome research is too weak to filter out many of these effects, as not enough genome data is available. Understanding the genetics of complex diseases, especially those related to genetic differences between ethnic groups, is essentially a big data problem. And researchers need more data. 1,000,000 genomes The link between genetics and disease is nuanced. However, the more genomes you can examine, the easier it is to find these connections. brian0918 / Wikimedia Commons To meet the need for more data, the National Institutes of Health have started a program called All of Us. The project aims to collect genetic information, medical records and health habits from surveys and wearables from more than one million people in the United States over a period of 10 years. The aim is also to collect more data from underrepresented minority groups in order to facilitate the study of health differences. The All-of-Us project opened for public registration in 2018, and since then more than 270,000 people have contributed samples. The project continues to recruit participants from all 50 countries. Many academic laboratories and private companies participate in this effort. These efforts could benefit scientists in a wide variety of fields. For example, considering the level of exercise, a neuroscientist might look for genetic variations associated with depression. An oncologist might look for variants that correlate with reduced skin cancer risk while examining the influence of ethnic background. A million genomes and related health and lifestyle information will provide an extraordinary wealth of data that should enable researchers to discover the effects of genetic variation on disease not only for individuals but also for different groups of people. [Understand new developments in science, health and technology, each week. Subscribe to The Conversation’s science newsletter.] The Dark Matter of the Human Genome Another benefit of this project is that scientists can learn about parts of the human genome that are currently very difficult to study. Most genetic research has focused on the parts of the genome that code for proteins. However, these only make up 1.5% of the human genome. My research focuses on RNA – a molecule that converts the messages encoded in a person’s DNA into proteins. However, RNAs, which originate from 98.5% of the human genome and do not make proteins, have a multitude of functions of their own. Some of these non-coding RNAs are involved in processes such as cancer spread, embryonic development, or control of the X chromosome in women. In particular, I’m exploring how genetic variation can affect the intricate folding that allows non-coding RNAs to do their job. Since the All-of-Us project encompasses all coding and non-coding parts of the genome, it will be by far the largest dataset relevant to my work and hopefully shed light on these mysterious RNAs. The first human genome sparked 20 years of incredible scientific progress. I think it is almost certain that a huge dataset of genomic variation will provide clues about complex diseases. Thanks to large-scale population studies and big data projects like All of Us, researchers are paving the way to answer over the next decade how our individual genetics affect our health. A photo in this story has been updated to better reflect our editorial guidelines. This article was republished by The Conversation, a non-profit news site dedicated to sharing ideas from academic experts. It is written by: Xavier Bofill De Ros, National Institutes of Health. Read More: Why Sequencing the Human Genome Didn’t Make Big Disease Breakthroughs Sequencing the great white shark genome is cool, but for bigger insights we need libraries of genetic data. Xavier Bofill De Ros has received funding from the National Institutes of Health (NIH). He is a member of ECUSA, an association of Spanish scientists in the USA.