Improving the Accuracy of Genomic Data

Posted 19 February 2015 by

Imagine that you want to analyze the 3.2 billion bases of the human genome. If you recruited every undergraduate student at ASU, all 70,000 of us, to type those data into a spreadsheet, it would still take about 13 hours. So you develop a computer program that analyzes the data for you. But then you find out that your huge data set amplified small errors in your algorithm and gave you the wrong answer. This is the issue facing evolutionary biologists using genomic data, a practice that is becoming standard to construct reliable phylogenies (see our previous posts about the new bird and insect phylogenies). Our lab, working under Dr. Reed Cartwright, has developed a novel method to quickly analyze genomic data and produce an accurate phylogeny that improves upon previous techniques.
The giant panda genome was assembled using de novo techniques in 2010, but better methods of phylogeny construction are in development. Image: Wikipedia
Historically, scientists have compensated for potential inaccuracies in genomic-size data in two ways: by using better statistical tools to analyze the data after they have been acquired or by acquiring fewer, more informative data. In the first method, you start with sequenced genomes in the form of short fragments (about 100 base pairs) and develop computational algorithms to compare those sequences to a reference genome for reassembly, like Liu et al. did in their 2003 analysis of primate genomes. The reference genome is one that we know with a high level of confidence; for example, the human genome is reliably known and often used as a reference. If, however, a reference is unavailable or unreliable, you could use a computer program to assemble the sequences with a process known as de novo assembly, which Li et al. used to construct the giant panda genome in 2010. These programs, called assemblers, use graphical techniques (for example, De Brujin graphs) to remove errors in phylogenetic trees and resolve repeated data that are harder to determine in short sequences than longer ones. Algorithms like this can greatly improve the accuracy of conclusions made from genomic data, but de novo assembly without a reference genome requires high quality annotation of the sequences and, once the genome is reconstructed, time-consuming alignments of similar sequences to produce a phylogenetic tree. Alternatively, you could acquire fewer data in the first place. You would need to determine which markers in a genome are informative and necessary to draw certain conclusions and then only obtain those data. By reducing the size of the data set and eliminating unnecessary information, we improve the accuracy without having to implement sophisticated analytical techniques. McCormack et al. used this principle in 2012 to determine the tree of placental mammals from certain markers. However, the major drawback of this method is that markers appropriate for a particular project or species most likely cannot be reused for other projects. The ability to recycle genomic data reduces the cost and time of phylogenomic studies. Our lab is working on a program that constructs phylogenetic trees more quickly and easily than either of these methods. The program, called SISRS, combines genome assembly with identification of homologous genes to rapidly reconstruct phylogenies without the need of a reference genome or annotation. In the next post, we'll go into detail about how SISRS works and what makes it a better way to analyze genomic data. This series is supported by NSF Grant #DBI-1356548 to RA Cartwright.

5 Comments

John Harshman · 19 February 2015

If this is anything like in silico PCR, then I'm all for it.

DS · 19 February 2015

Thanks for the McCormack reference. It's the perfect example of appropriate data versus complete data. Of course the kind of data they used is not going to be useful for phylogenetic inference at every level of divergence, no data is. But that doesn't mean that this is not the right approach to take when addressing specific phylogenetic issues.

https://www.google.com/accounts/o8/id?id=AItOawl30lI0y4I-EF3HUTblY5lhVnxhcj1stbE · 19 February 2015

For those who want to look at / use SISRS go to https://github.com/rachelss/SISRS/tree/develop

John Harshman · 19 February 2015

Actually, the McCormack et al. sort of data might in fact be useful at every level, since the sequence conservation is near-complete within the ultraconserved elements themselves (hence the name) but falls off gradually on either side to, eventually, neutral levels. So there's some spot within that continuum for everybody's needs.

DS · 21 February 2015

John Harshman said: Actually, the McCormack et al. sort of data might in fact be useful at every level, since the sequence conservation is near-complete within the ultraconserved elements themselves (hence the name) but falls off gradually on either side to, eventually, neutral levels. So there's some spot within that continuum for everybody's needs.
Absolutely. That's the point. You don't just throw in all the data and hope for the best. The signal to noise ratio is not going to be good enough for that approach to work. You have to subdivide the data and use the data appropriate for any given level of divergence. That way you get a good signal to noise ratio for every level of divergence. This is why other types of molecular data such a SINE insertions and mitochondrial gene order have become so popular. We know a priori what level of divergence such data is likely to be useful for. The power of genomics is not that you can use whole genomes for phylogenetic analysis, but that you can get markers useful at every level of divergence quickly and cheaply. The important thing is to use the appropriate data for the question being addressed.