My thesis work is on the discovery of genomes (or nuclear DNA) of organisms from experimental data. When I say that we are discovering a genome, I mean that we are attempting to determine the genomic sequence. This is because unlike other cellular molecules (for example RNA and proteins), DNA does not have a well understood secondary structure, and modeling the DNA as a linear molecule has proven useful for biologists.
In diploids, the nuclear DNA of an organism consists of a small number of seperate DNA molecules – every two molecules paired and called a chromosome. Humans have 23 such chromosomes. A genetic map is an approximate placement of genetic markers on the chromosomes of an organism. These markers can be thought of as signposts along the genome, and a genetic map can be thought of as a simplified representation of the much longer genomic sequence. Historically, genetic maps have been used to discover which regions of chromosomes encode genes that affect particular diseases and traits. In our work, we define the problem of finding a consensus map given multiple maps for the same organism, and we show that our formulation of the problem is NP-hard, although it can often be solved in practice.
The genomic sequence can be directly represented by a set of strings over the alphabet {A,G,C, T}, one string for each molecule of nuclear DNA. For the genome assembly problem, we try to discover these strings by piecing together many small substrings directly determined through experiment. Often we cannot completely reconstruct such strings from the data, instead ending up with smaller assembled substrings. We can use a genetic map to discover the positions of each of these pieces along the chromosome. In our work, we develop the first fully parallel method for performing such an assembly, using simple functional building blocks similar to MapReduce. Because of this, our assembler can be adapted to many architectures, including cloud architectures, heterogeneous architectures, serial, and disk-based architectures by replacing a few library elements.
Importantly, our work is in direct response to changes in the quantity and types of biological data available for determining these representations of the genome. Our consensus mapping work relies on access to multiple genetic maps for an organism, which have only recently become available. Our assembly work is only necessary because of the recent development of high throughput methods for reading short sequences from the genome. We have described novel computational methods that effectively discover genomes from these data.
A consistent theme in our work is the use of graph models to achieve our goal. We model a genetic map as a partial order and represent this using a directed graph, and our genome assembly work makes use of a bidirected string graph model for sequence assembly. We describe ways in which we represent and manipulate this graph in distributed memory.