The Language of Biology: How the Heck Do Scientists Assemble a Genome?

Genome sequencing
Sifting through the 3.2 billion base pairs in the human genome is no easy feat. To assist the process, scientists are using computers. Jacopo Werther

The genome—the complete suite of an organism’s DNA and genes—is likened to a blueprint for life. On the surface, this comparison provides some understanding of a biological concept. But according to some scientists, it misses the mark.

“Blueprints tend to make a lot of sense whereas genomes don’t tend to make a lot of sense,” said Professor Ian Korf, Department of Molecular and Cellular Biology and the UC Davis Genome Center.

“Other people call it the ‘book of life.’ It’s the ‘book of life,’ but it’s written in a language that nobody really understands and written probably in a way that’s sort of idiotic at times.”      

According to current estimates, roughly 75 percent of the human genome appears to be repetitive “junk” with no currently known encoding function. That means only about 25 percent of the genome is responsible for encoding the proteins necessary for all human biological functions. Still, scientists have to sift through the roughly 3.2 billion base pairs of the human genome to find clues regarding human health and disease. And they’re doing so with the help of computers.

Decoding and analyzing genomes is the bread and butter of Korf and his lab. At the Genome Center, Korf designs new technologies to help his colleagues further understand the structure and function of the various genomes they’re studying.  

“When you’re thinking about disease and trying to map genotype to phenotype and these kinds of things, there’s so many combinations of genes; it’s not a simple problem,” said Korf.

“When I tell people what I do,” he added, “I tell them that I write software that decodes the book of life.”

Human chromosomes
The genome of an organism is broken down into chromosomes. Josef Reischig
Line by line through the genome

When scientists sequence a genome, they’re taking an organism’s DNA and determining the order of its base pairs, which are coded by letters A, C, T and G. These lettered pairs comprise the rungs of the DNA double helix.

“The genome is broken down into chromosomes,” explained Professor Luca Comai, Department of Plant Biology and UC Davis Genome Center. “Each chromosome is a continuous, linear DNA piece. So, it’s a chapter—a relatively long chapter—in your book.”

UC Davis scientists study plant genomes—like those of maize—to get a better understanding of the effects of hybridization—the natural or artificial combination of different plant species. David Slipher/UC Davis  

Comai studies plant genomes—like those of rice and tomato—to better understand the effects of hybridization, the combination of different plant species, on genomic evolution and performance.

In humans, each cell has 46 chromosomes (23 from each parent) for a total of two copies of the basic human genome, but in plants, the number of independent genomes is often two (coffee and peanut) or three (bread wheat), depending on the plant’s genetic lineage, parents and hybridization history. This means that many plant species have multiple genomes living in the nuclei of their cells. So rather than having one book of life, they have a library.  

Current sequencing methods allow scientists to “extract lines out of each book,” according to Comai. From there, the genome is reassembled line by line, as if taping together the individual letters of a book that’s been torn apart by a paper shredder.

There are multiple ways to do this and the processes are still prone to errors, so scientists need to meticulously evaluate each sequencing read, comparing and contrasting it to previous reads base pair-by-base pair. The reads are then compiled into an order that’s representative of the organism’s genome.

“The trick is to understand how to assemble all these bits, that really is the meaning of sequencing a genome because generating a read is trivial,” said Comai. “It’s putting them together that is the key.” 

Just the beginning

“Every technological change brings about a new way to assemble a genome,” said Korf. “The problem with assembling a genome is it’s not like when you fly to the moon and you get there and you land on the surface and you’re like, ‘Oh, I’m here. I’ve made it.’”

With the genome, you don’t actually know much from that initial landing, according to Korf. It’s quite a paradoxical problem. Korf likened it to collecting stamps or coins from a country with no historical record. How would the collector know the collection is complete if there’s nothing to compare it to? That same idea applies to genome assembly. But scientists make do by comparing their collections to others.   

“One of the ways to assess a genome—whether you’ve completed a genome—is that there are certain types of genes that exist in every organism,” Korf said. “You can’t live without these genes.” 

If those said genes aren’t present in the assembled genome, then scientists know something is missing and they need to reevaluate. The end product is a readable draft genome. Despite the difficulty of establishing this genomic foundation, Korf said it’s a small problem compared to actually analyzing the genome and assigning specific functions to its genes.

Genome sequencing and assembly is the foundation. While it might be a shaky one, it’s a jumping off point for exploring the potential health applications of genomics.

Animal collage
The genome—the complete suite of an organism’s DNA and genes—is likened to a book of life. Wikimedia Commons
“Editing” common misconceptions about the genome

The cost to generate a draft genome is minimal compared to what it once was. According to the National Human Genome Research Institute, the initial draft of the human genome, produced by the Human Genome Project, cost roughly $300 million in 2000. Today, a coherent draft can be generated for as little as $15,000, according to Comai.

“There’s been a democratization of producing genomes,” said Comai, who noted that it’s a new era for the field. Keeping with the book of life metaphor, he added that new methods allow for greater clarity when it comes to reading and understanding the genome. “When you read the chapter, it starts here and ends here and you can read the story,” he said.

Despite the importance of genome sequencing and assembly, misconceptions abound in the public. One prevalent misconception, Comai noted, is the idea that the genome of an individual is applicable to an entire species. 

“If you pick one variety of maize and you sequence that genome, you may fail to see thousands of genes which are absent in that variety but present in others,” he said. “To understand what genes a species has you really have to sequence multiple individuals.”

Comai noted that’s why there was interest in sequencing 10,000 humans. Such efforts ensure that the human genomes on record represent a diverse swath of the human population.

The increasing accessibility to affordable genome sequencing and assembly technologies means advances could be made in areas like personalized medicine, which may allow for the development of treatments based on a patient’s specific genome. Such genomic techniques are already being used for diagnosis and many UC Davis researchers are working to better understand how personalized therapies for diseases, like cancer, can be be tailored with the help of genomics.  

“There’s a whole bunch of new things that are going to happen because we have whole genomes, but these things take time,” Korf said.