Genomic abnormalities are often associated with various genetic disorders, degenerative diseases, and cancer. For example, the deletion or multiplication of copies of genes and the deletion or amplifications of genomic fragments or specific regions are common occurrences in cancer. For instance, alterations in proto-oncogenes and tumor-suppressor genes, respectively, are frequently characteristic of tumorigenesis. The identification and cloning of specific genomic regions associated with cancer and various genetic disorder is therefore of interest both to the study of tumorigenesis and in developing better means of diagnosis and prognosis.
Identification of polynucleotides that correspond to copy number alterations in cancerous, pre-cancerous, or low metastatic potential cells relative to normal cells of the same tissue type, provides the basis for diagnostic tools, facilitates drug discovery by providing for targets for candidate agents, and further serves to identify therapeutic targets for cancer therapies that are more tailored for the type of cancer to be treated.
In diagnostic genome sequencing, the computational complexity involved in sequence analysis of three billion base pairs in the human genome is further compounded by the accuracy requirements of clinical diagnostics such that 60 billion or more sequence data points must be analyzed to provide one accurate genome sequence. This complexity was dealt with in early sequencing methods by generating sequence data from thousands of isolated, very long fragments of DNA, thereby preserving the contextual integrity of the sequence information and reducing the redundant testing required for accurate data. However, this approach, used to generate the first complete human genome, cost hundreds of millions of dollars per genome due to the up-front complexity of preparing the genome fragments and the relative high cost of many individual biochemical tests.
In addition, contextual information in the genome is compounded by the presence of two distinct copies of the genome in each human cell such that accurate clinical analysis and diagnosis requires the ability to distinguish DNA sequence as a function of genome copy. Thus, a major challenge is to distinguish sequence differences between the two unique copies of the three billion DNA bases interspersed with millions of inherited single nucleotide polymorphisms (SNPs), hundreds of thousands of short insertions and deletions and hundreds of spontaneous mutations.
Some approaches have been developed that aid in the identification of copy number variants (“CNV”) within a complete DNA sequence, and to aid in the confidence of the identification based on comparison of the sequence with reference sequences or multiple different copies of the sequence. In these approaches identification of copy number and its validation is based on different sets of samples, and the data used in such approaches is relatively error-prone and known to harbor certain artifactual biases.