This invention relates to methods and means for rapid screening of target nucleic acid molecules for the presence of sequence signatures. In preferred embodiments, hybridization data is processed by a programmable digital computer.
Polynucleotide arrays, such as the GeneChip(copyright) array (Affymetrix, Inc., Santa Clara, Calif., USA), can contain many thousands of differently sequenced polynucleotide probes at feature densities greater than five hundred thousand per 1 cm2. Such arrays enable one to obtain nucleotide sequence information from target nucleic acid molecules. The information is obtained by performing a hybridization reaction between the target nucleic acid molecule and the polynucleotide probes on the polynucleotide array. The location and identity of the probes to which the target has hybridized, and the extent of hybridization, is determined. Because hybridization between nucleic acids is a function of their sequences, analysis of the sequence of the probes to which the target has hybridized, as well as the extent of hybridization, provides information about the sequence of the target molecule.
Because polynucleotide arrays can have many thousands of probes, hybridization reactions create large amounts of raw data for analysis. Already, several ways of processing such data have been developed. In one application, one examines hybridization between a target molecule and a set of probes that are based upon a reference nucleotide sequence. Probes in the set to which the target does not hybridize or hybridizes weakly indicate sequences in which the target differs from the reference sequence. Nucleic acid arrays have been used to interrogate single nucleotide differences between reference and target nucleic acid sequences. Examples include the identification of genetic variants of infectious agents, such as HIV, or genes associated with human genetic diseases, such as cystic fibrosis.
Other ways of obtaining useful information from hybridization data would be of benefit to the scientific and medical communities.
The present invention involves a hierarchical method of array-based analysis in which single nucleotide base determination may or may not be one step. The present invention has several embodiments, many of which involve the determination of a sequence signature. Useful sequence signatures include polynucleotide or polypeptide sequence signatures, such as those defining protein domains, gene families, different genes in a genome, repeat sequences, or polymorphic forms of a gene. The methods involve performing hybridization assays between the target nucleic acid molecules to be screened and polynucleotide arrays designed to identify targets that contain the sequence signatures. The arrays contain probe sets. The probes in a set, taken together, represent the sequence of the sequence signature, or variations upon that sequence. Thereby, the probes define the reference sequence signature and sequences related to the sequence signature. A hybridization assay between the target molecule and the probes in the array generates data about which probes the target has hybridized to. The extent of hybridization may likewise be determined. Computer programs are then used to process the data. By determining whether the target has hybridized to probes defining one or more reference sequences, or to probes defining sequences that deviate from the reference sequences, one can determine whether the target has the same sequence or a sequence similar to one or more of the reference sequences. By selecting appropriate reference sequences to put on the array as probes, one can determine whether a target encodes a particular closely related polypeptide sequence signature, is a member of a gene family, or has the sequence of a particular or closely related gene in the genome. One can also look at patterns of differences between target and reference sequences to identify novel gene families, new members of gene families, and the like. By identifying the similarities and/or differences between the reference and target sequences, one can also determine the position on the chromosome of a target nucleic acid molecule.
To determine whether a target nucleic acid molecule contains a sequence signature, the following steps can be employed: providing a polynucleotide array comprising a set of polynucleotide probes that define the sequence signature; generating hybridization data by performing a hybridization reaction between the target nucleic acid molecule and the probes in the set and detecting hybridization between the target nucleic acid molecule and each of the probes in the set; and processing the hybridization data to determine whether the target nucleic acid molecule has the sequence signature. In certain embodiments, the sequence signature is a polypeptide sequence signature; the sequence signature contains variable positions; and the step of processing is performed by a programmable digital computer. In another embodiment, if the sequence signature is an amino acid sequence signature, the array comprises sets of probes that define the degenerate set of nucleotide sequence signatures encoding the polypeptide sequence signature. In addition, or as an alternative to degenerate probe sets, useful probe sets can contain inosine, other generic bases, or mixtures of A, C, T, G at the 3d position of a codon site. Probe sets can also contain sequences that query the presence of polymorphic variants of a sequence signature.
One aspect of the invention provides a method of analyzing a nucleic acid sample, comprising selecting a hierarchy of assay techniques comprising at least a first and second assay. The first assay is selected to provide a determination of the presence or absence or variant of a first sequence signature and the second assay is selected to provide a determination of the presence or absence or variant of a second sequence signature. At least one of the assays employs a high-density nucleic acid array. One analyzes the nucleic acid sample using the first assay. One may then opt to analyze the nucleic acid sample in a second assay depending upon the results of the first assay.
In a further embodiment, the first or second sequence signature is a conserved region of a gene family. In certain embodiments, the first or second sequence signature is a nonconserved region of a gene family. The method can additionally comprise determining the full length sequence of said nucleic acid target.
The present invention also provides a method of selecting clones for analysis. This aspect of the invention provides a support having a variety of clones associated with it. The support is exposed to one or more polynucleotides under low, medium, or high stringency conditions to permit at least some hybridization between the clones and the polynucleotides. One identifies the clones that hybridize with the polynucleotides. Clones selected for analysis are those not identified as hybridizing to the polynucleotides. In one embodiment of this method, the support is a high-density nucleic acid array.
Also provided is a method of screening a nucleic acid sample for analysis. The steps are: providing a sample containing nucleic acids; analyzing whether the sample contains a sequence signature using a high-density nucleic acid array; and further analyzing the nucleic acid sample only if that sequence signature is not present.
This invention also provides a method for determining whether a target molecule has a sequence from a gene family member. The method involves providing a polynucleotide array comprising, for each of at least two different gene family members, a set of polynucleotide probes that define a reference nucleotide sequence from the region of the gene family member; generating hybridization data by performing a hybridization reaction between the target nucleic acid molecule and the probes in the sets and detecting hybridization between the target nucleic acid molecule and each of the probes in the sets; and processing the hybridization data to determine whether the target nucleic acid has the reference sequence from one of the gene family members.
In one embodiment, the step of selecting the target nucleic acid molecule is performed by determining whether the target hybridizes to a nucleic acid probe that hybridizes to a gene encoding the gene family members. In another embodiment, the step of processing is performed by a programmable digital computer. In another embodiment, the polynucleotide array further comprises, for each of the gene family members, a probe set defining a highly conserved region of the gene and a probe set defining a highly variable region of the gene. In a further embodiment, the polynucleotide array further comprises, for each of the gene family members, probe sets defining at least two highly conserved regions of the gene and probe sets defining at least two highly variable regions of the gene. In another embodiment, the reference nucleotide sequence codes for an amino acid sequence and the array further comprises probe sets capable of defining the different nucleotide sequences encoding the amino acid sequence. In one embodiment, the method further comprises the step of determining the nucleotide sequence of the target nucleic acid molecule if the target does not have the chosen signature sequence of the gene family member.
In another aspect, the invention provides a computer program product for analyzing hybridization data comprising: code that receives as input the sequence of a polynucleotide probe in each feature of a polynucleotide array; code that receives as input reference nucleotide sequences from a plurality of members of a gene family; code that identifies a set of features in the polynucleotide array having probes that define the nucleotide sequences; code that receives as input hybridization data from a hybridization reaction between a target nucleic acid molecule and polynucleotide probes in the polynucleotide array; code that processes the hybridization data to determine whether the target nucleic acid molecule has a sequence from any of the reference sequences; and a computer readable medium that stores the codes.
In another aspect, this invention provides a method that involves determining whether a target nucleic acid molecule comprises a sequence from one of a set of genes. The method comprises: providing a target nucleic acid molecule comprising nucleotide sequences from genomic DNA; providing a polynucleotide array comprising, for each gene in the set, polynucleotide probes that define at least one sequence signature from a unique region of the gene; generating hybridization data by performing a hybridization reaction between the target nucleic acid molecule and the probes in the sets and detecting hybridization between the target nucleic acid molecule and each of the probes in the sets; and processing the hybridization data to determine whether the target nucleic acid comprises a sequence from the unique region of one of the genes. In one embodiment, the step of processing is performed by a programmable digital computer. In another embodiment, the unique region of the gene codes for an amino acid sequence. In a further embodiment, the polynucleotide array further comprises, for each of the unique regions, a set of polynucleotide probes whose sequences define the degenerate set of nucleotide sequences that encode the amino acid sequence. The probes in such embodiments can in addition or as an alternative comprise sequences that contain generic bases such as inosine particularly at the third codon position. As an even further additional or alternative option, polynucleotide probes can have a mixture of A,C,T, and G in the third codon position within a single feature of a polynucleotide array.
In another aspect, this invention provides a computer program product for analyzing hybridization data comprising: code that receives as input the sequence of a polynucleotide probe in each feature of a polynucleotide array; code that receives as input sequence signatures from a unique region of a plurality of genes; code that identifies a set of features in the polynucleotide array having probes that define the sequence; code that receives as input hybridization data from a hybridization reaction between a target nucleic acid molecule and polynucleotide probes in the polynucleotide array; code that processes the hybridization data to determine whether the target nucleic acid molecule comprises a sequence from any of the sequence signatures; and a computer readable medium that stores the codes.