Parallel Short Sequence Assembly

Sequencing techniques have always been limited to directly determining relatively short pieces of the genome; the traditional Sanger method would produce a read of average length 800, while the genome of a eukaryote can be several billion bases long. Recently the Sanger method has been supplemented by high-throughput techniques. These methods, which include the 454 sequencer and Illumina sequencer, produce even shorter reads (30- 200 bases), but produce a large number simultanously and cheaply. The impact on the assembly problem is substantial, as the number of reads is necessarily higher than before, while the information content per read is less.

We construct an assembly by piecing together these imperfect reads using overlaps between them. This requires substantial oversampling of the DNA of the organism (traditionally the read dataset would have a total length 3 to 10 times that of the genome). The assembly problem is quite hard because DNA is non-random; there exist many repeated elements longer than the read length interspersed in the genome. To help with assembly, reads could be paired, with each clone pair separated by some fuzzy distance constraint.

Practical methods for solving the assembly problem have used greedy heuristics. They start with a good pairwise alignment as a seed, and then extend the partially assembled region by adding one read at a time. The greedy method has proved insufficient for short sequence assembly. Graph based approaches had been proposed and been shown to produce a better solution, but the memory requirement limited their use to small problem sizes [15], [19], [20], [22], [25], [29]. The biggest problem solved using these approaches was the assembly of a 39 million base genome using a workstation with 64 gigabytes of RAM [10], but the genomes of eukaryotes can be several billion bases long.

In our work [3], [4], [5], [6] we address this shortcoming by developing parallel methods for the construction and manipulation of graphs for sequence assembly, leveraging the large distributed memory available on high performance computers. The main results of our research follow:

  1. We have developed a parallel method for creating a bidirected string graph from a set of input reads totaling hundreds of billions of characters. A bidirected graph is a graph in which each edge has two directions, one at each endpoint [3].
  2. We have developed a parallel method for compacting edges in this graph (constructing the smallest order graph that is homeomorphic with the original graph), by converting the problem to the undirected list ranking problem.
  3. We have adapted the sparse ruling set algorithm [24] for list ranking to work with undirected lists.
  4. The assembly software is organized around a library inspired my MapReduce. This allows us to change the target architecture by reimplementing a total of four functions.
  5. For transcriptome assembly:
    • We have developed a parallel method for transforming the compacted graph into what is known as a conflict graph, by iteratively collapsing independent sets of edges in the graph. A conflict graph is a graph in which, for each node u:
      • u has exactly one adjacent edge, or
      • multiple adjacent edges point into u and multiple adjacent edges point out of u
    • We have developed a parallel method that utilizes coverage information to further refine the assembly of transcriptomes (the part of DNA that is transcribed by the cell because it encodes genes). We assembled a simulated data set of 200x coverage of the Zea Mays transcriptome in a few minutes on a 1024 node Blue Gene/L [4],[6].
  6. For genome assembly:
    • We have developed a parallel method that extracts features that we call partial (k+1)-pair clusters from the paired reads in the seqeunce data, and then uses these features to create high quality ne novo assemblies from short reads.[5]
    • Our method:
      • Is the first method that effectively utilizes distributed memory machines
      • Can scale to gigabase genomes
      • Can assemble a high coverage data state from a 100+ megabase genome in ~4 hours