Sequencing techniques have always been limited to directly determining relatively short pieces of the genome; the traditional Sanger method would produce a read of average length 800, while the genome of a eukaryote can be several billion bases long. Recently the Sanger method has been supplemented by high-throughput techniques. These methods, which include the 454 sequencer and Illumina sequencer, produce even shorter reads (30- 200 bases), but produce a large number simultanously and cheaply. The impact on the assembly problem is substantial, as the number of reads is necessarily higher than before, while the information content per read is less.
We construct an assembly by piecing together these imperfect reads using overlaps between them. This requires substantial oversampling of the DNA of the organism (traditionally the read dataset would have a total length 3 to 10 times that of the genome). The assembly problem is quite hard because DNA is non-random; there exist many repeated elements longer than the read length interspersed in the genome. To help with assembly, reads could be paired, with each clone pair separated by some fuzzy distance constraint.
Practical methods for solving the assembly problem have used greedy heuristics. They start with a good pairwise alignment as a seed, and then extend the partially assembled region by adding one read at a time. The greedy method has proved insufficient for short sequence assembly. Graph based approaches had been proposed and been shown to produce a better solution, but the memory requirement limited their use to small problem sizes [15], [19], [20], [22], [25], [29]. The biggest problem solved using these approaches was the assembly of a 39 million base genome using a workstation with 64 gigabytes of RAM [10], but the genomes of eukaryotes can be several billion bases long.
In our work [3], [4], [5], [6] we address this shortcoming by developing parallel methods for the construction and manipulation of graphs for sequence assembly, leveraging the large distributed memory available on high performance computers. The main results of our research follow: