The human genome is estimated to contain 3 billion base pairs of DNA. Within the genome, it is believed that approximately 50,000 to 100,000 gene coding sequences are dispersed. The gene sequences are thought to represent about 3% or approximately 90 million base pairs of the human genome.
It is generally recognized that elucidation of the structure of all human genes and their organization within the genome will be beneficial to the advancement of medicine and biology. Databases such as the Genome Sequence Data Bank and GenBank serve as repositories of the nucleotide sequence data generated by ongoing research efforts. Despite the efforts to date, GenBank lists the sequences of only a few thousand human genes.
Recent advances in automated, large-scale sequencing techniques have led to the initiation of two broad approaches to obtaining the sequence of the human genome. While scientific debate continues as to the best approach, chromosome mapping and sequencing and cDNA sequencing projects have begun in earnest.
The Human Genome Initiative, a multinational effort having government backing in the United States and other countries, is attempting to characterize the genomes of humans and other model organisms on a chromosomal approach. In the private sector, large-scale sequencing of cDNA reverse transcribed from mRNA expressed in various human tissues, cell types and developmental stages is being pursued by a number of entities.
After publication of the Maxam-Gilbert and Sanger et al. nucleotide sequencing techniques, manual gene sequence assembly methods were practical for single gene or viral genome sequencing projects. As sequencing projects became more ambitious, manual techniques could be supplemented by computer-assisted sequence assembly where overlaps between fragments were identified by software rather than by eye. However, the large scale of DNA sequencing projects and the rapidity with which sequence data is generated by automated sequencer machines has resulted in data analysis becoming a rate-limiting step in assembly of gene sequence data. The volume of data being generated by large-scale sequencing projects requires automated analysis in order to provide assembled sequence data in a timely manner.
Towards this end, efforts have been made to improve computer-assisted assembly of nucleotide sequence data. For example, in "Automated DNA Sequencing and Analysis", Adams et al. eds., Academic Press (1995), E. W. Myers presents a discussion of software systems for fragment assembly in Chapter 32, while S. Honda et al. describe in Chapter 33 the Genome Reconstruction Manager, a long-term software engineering project to develop a system to support large-scale sequencing efforts.
Despite these efforts, a need exists for improvements over existing methods. The improved methods will provide computer-assisted nucleotide sequence assembly methods capable of more accurately and more efficiently assembling large amounts of sequence data.