This application concerns the field of identification of genetic variances in genomes of mammalians. In the cases of gene sequence variances within the human genome in human populations, the identification of said sequence variation has utility in determining response to drug therapy.
Many drugs or other treatments are known to have highly variable safety and efficacy in different individuals. A consequence of such variability is that a given drug or other treatment may be highly effective in one individual, and ineffective or not well tolerated in another individual. Thus, administration of such a drug to an individual in whom the drug would be ineffective would result in wasted cost and time during which the patient""s condition may significantly worsen. Also, administration of a drug to an individual in whom the drug would not be tolerated could result in a direct worsening of the patient""s condition and could even result in the patient""s death.
For some drugs, up to 99% of the measurable variation in selected pharmacokinetic parameters has been shown to be inherited, or associated with genetic factors. For a limited number of drugs, discrete gene sequence variances have been identified in specific genes that are involved in drug action, and these variances have been shown to account for the variable efficacy or safety of the drug in different individuals.
The exponentially growing number of publicly available expressed sequence tags (ESTs) have led to the development of approaches to rapidly identify high-throughput methods for the detection of single nucleotide polymorphisms (SNPs). These methods focus on the manipulation of available sequence data to identify the most common form of DNA sequence variation, SNPs. Sequence fragments from many different individuals can be assembled into overlapping sequence assemblies and genetic differences at the single nucleotide base level can be identified. A variety of different techniques have been developed for rapidly identifying potential SNPs from these assemblies and for scoring these polymorphisms in such a way to distinguish them from sequencing error or other experimental artifact. In such a way, sequence scanning can identify potentially informative genetic markers which can then be correlated to physiologic function, pathophysiologic disease, or drug or therapeutic intervention response.
Manual methods to determine the polymorphisms within a gene or gene family are best suited for a gene or gene family having small numbers of base pairs. This method is user or investigator dependent and requires significant effort in the analysis and interpretation.
Automated methods that make use of sequencing chromatograms to assist in the determination of quality and locations of polymorphisms allow analysis of additional sequences. Unfortunately, automated polymorphism detection is complemented with visual inspection of the chromatogram traces and such methods are best suited for polymorphism detection in gene families or in genomic loci where the base pair number is in the tens of thousands. A further limitation to this method is that not all chromatograms are available for each of the ESTs. Thus, this method may frequently require further substantial sequence data and corresponding information.
In establishing a link between drug response and genetic polymorphism, one must use all available ESTs and thus a computational method for the high-throughput analysis of this sequence data for the identification of these potentially critical genetic polymorphisms. Not all of the identified differences in the ESTs data are SNPs, therefore, it is critical to establish reasonable and statistically stringent strategies to ensure that this analysis results in SNP detection within legitimate confidence limits.
Recently there have been several papers describing computational methods for the detection of genetic polymorphism. One, (Picoult-Newberg et al.) describes a staged-filter model. The method employs the sequence alignment capabilities of the PHRAP computer program (Phil Green, University of Washington) for assembly, and does not attempt to optimize this step. The method also utilize certain calculations in order to remove patches of low-quality sequence having particular characteristics. The method does not include use of any statistical scoring techniques, relying instead on confirmation by laboratory methods.
A second (Buetow et al.) method utilizes sequence data for which quality scores and chromatograms are available that is used as the basis for assembly, again with the use of the PHRAP program. Certain calculations are performed to remove particular types of low-quality sequence. The method makes use of the quality and chromatogram data within these processes, presumably to improve error rates, at the cost of not being able to use sequence data directly from the database, especially sequence data for which such additional information may not be available. The method follows the filtration process with a statistical scoring method based on Baysian statistics.
The inventors have determined that the identification of gene sequence variances within genes that may be involved in drug action is important for determining whether genetic variances account for variable drug efficacy and safety and for determining whether a given drug or other therapy may be safe and effective in an individual patient. Provided in this invention is a method for the identification of such gene sequence variances which can be useful in connection with predicting differences in response to treatment and selection of appropriate treatment of a disease or condition.
In the present invention, we have identified a computational method for the rapid determination of genetic sequence variation for the purposes of determining the correlation of drug response with genetic variation for a population. In addition, the method can also be used in other applications for which detection of sequence variances is desired. This invention has utility, for example, in programs including drug development, medical management programs, and retrospective analysis of a human population to drug therapy.
In a first aspect, this invention provides a method for identifying at least one variance in at least one gene. The method involves obtaining at least three independent electronic nucleic acid sequences with sequence overlap regions for each gene, comparing the sequence overlap regions for each gene to identify sequence differences; and analyzing the sequences or sequence differences or both to discriminate sequencing errors from sequence variances for each said gene.
Preferably the analyzing includes comparing the at least 3 electronic nucleic acid sequences to identify sequence differences between said sequences, and then applying at least one of the following filters that are helpful for distinguishing true variances from artifacts or sequence errors. One filter involves identifying and removing or discounting sequence differences in portions of the sequences in which the number of sequence differences in an analysis window exceeds a predetermined limit. A second filter involves identifying and removing or discounting of consecutive mismatches. A third filter involves assigning sequence differences a probability of representing a true variance based on sequence context. A fourth filter involves performing a calculation utilizing the detection of particular sequence differences at the same sites in multiple sequences as an indication that each such sequence difference represents a true variance. Preferably the analysis result is a score that is derived from the probability that a detected sequence difference represents a true variance, the above filters can be used singly or in any combination, and can also be used with other filters or sequence quality information.
Thus, in preferred embodiments, the variance scanning method of this invention generally follows four steps. First, the cDNA fragment sequences (ESTs) which are all derived from the same gene are clustered together. Similarly, any set of sequences from a gene can be assembled, e.g., genomic sequences or cDNA sequences. Second, those sequences are aligned together, either by multiple alignment methods (e.g. PHRAP) or by iterative pairwise alignment. Third, areas of poor sequence quality are filtered out by a variety of processes, removing cases where observed sequence polymorphism is artifact due to error. Fourth, the remaining sites of polymorphism are scored on the basis of the likelihood that they represent true polymorphism rather than artifact. This fourth step generally employs information from multiple sequences in the alignment at the point of variation, and has the capacity to employ statistical models for purposes of validation. The sensitivity and selectivity of variance detection by this method are typically greater for genes in which the sequence data, e.g., EST data, is complete along the whole sequence, rather than concentrated in the 3xe2x80x2 end, as is the case with most publicly available EST data.
In the context of this invention, xe2x80x9ctrue polymorphismsxe2x80x9d or xe2x80x9ctrue variancesxe2x80x9d are polymorphisms or variances which actually occur in the nucleic acid of individuals as compared to other individuals. This is distinguished from apparent or detected polymorphisms or variances which appear as differences in representations (generally electronic representations) of nucleic acid sequences, and which may represent true variances or may represent artifacts due to sequencing errors or other errors and do not represent actual nucleic acid sequence differences in the individuals from whom the sequence was determined.
Variances occur in the human genome at approximately one in every 500-1,000 bases within the human genome when two alleles are compared. When multiple alleles from unrelated individuals are compared the frequency of observation of variant sites increases. At most variant sites there are only two alternative nucleotides involving the substitution of one base for another or the insertion/deletion of one or more nucleotides. Within a gene there may be several variant sites. Variant forms of the gene or alternative alleles can be distinguished by the presence of alternative variances at a single variant site, or a combination of several different variances at different sites.
Determining the presence of a particular variance or plurality of variances in a particular gene in a population can be performed in a variety of ways. The term xe2x80x9ccomputational methodxe2x80x9d refers to a set of algorithms performed in a prescribed order.
The term xe2x80x9cfilterxe2x80x9d as used herein is an algorithm intended to exclude base pair mismatches observed between sequences to enhance the likelihood that base pair mismatches which pass the filter represent true variances rather than artifacts, e.g., sequencing artifacts.
The term xe2x80x9cdiscount sequence differencesxe2x80x9d or phrases of like import which refer to discounting observed variances or sequence differences between two or more sequences is intended to reflect the treatment of identified sequence differences that are not considered further, or are assigned a lower probability or weighting than would otherwise be utilized, in computational filters, algorithms, or in the results obtained.
The process of xe2x80x9cidentifyingxe2x80x9d or discovering new variances involves analyzing the sequence of a specific gene in at least two alleles, more preferably at least 3, 5, 7, 8, or 10 alleles, still more preferably at least 12, 15, 20, 30, or 40 alleles, and most preferably at least 50 alleles, or from at least that number of individual cell sources. The analysis of large numbers of individuals to discover variances in the gene sequence between individuals in a population will result in detection of a greater fraction of all the variances in the population. Typically, independent sequences reported in sequence databases represent sequencing from independent alleles. Thus, in the various aspects of this invention, in preferred embodiments, the numbers of independent sequences corresponding to a gene are utilized as just indicated for sequence variance in multiple alleles are utilized.
In the various aspects of this invention, preferably sequence information from at least 100 genes is analyzed, more preferably at least 500, 1000, 2000, 3000, 5000, 7000, 10000, or even more.
The sequence information is preferably for cDNA or genomic DNA. The organism can be any organism for which multiple sequences are available, but is preferably from a mammal, more preferably from human.
Preferably the process of identifying reveals whether there is a variance within the gene; more preferably identifying reveals the location of the variance within the gene; more preferably identifying provides knowledge of the sequence of the nucleic acid sequence of the variance, and most preferably identifying provides knowledge of the combination of different variances that comprise specific variant forms of the gene or alleles. In identifying new variances it is often useful to screen different population groups based on race, ethnicity, gender, and/or geographic origin because particular variances may differ in frequency between such groups. It may also be useful to screen DNA from individuals with a particular disease or condition of interest because they may have a higher frequency of certain variances than the general population.
The term xe2x80x9cgenotypexe2x80x9d in the context of this invention refers to the particular allelic form of a gene, which can be defined by the particular nucleotide(s) present in a nucleic acid sequence at a particular site(s).
The terms xe2x80x9cvariant form of a genexe2x80x9d, xe2x80x9cform of a genexe2x80x9d, or xe2x80x9callelexe2x80x9d refer to one specific form of a gene in a population, the specific form differing from other forms of the same gene in the sequence of at least one, and frequently more than one, variant sites within the sequence of the gene. The sequences at these variant sites that differ between different alleles of the gene are termed xe2x80x9cgene sequence variancesxe2x80x9d or xe2x80x9cvariancesxe2x80x9d or xe2x80x9cvariantsxe2x80x9d. The term xe2x80x9calternative formxe2x80x9d refers to an allele that can be distinguished from other alleles by having distinct variances at least one, and frequently more than one, variant sites within the gene sequence.
Other terms known in the art to be equivalent to xe2x80x9cvariancesxe2x80x9d include mutation and single nucleotide polymorphism (SNP). In this invention, the variances are selected from the group as identified through the use of a computational method. Reference to the presence of a variance or variances means particular variances, i.e., particular nucleotides at particular polymorphic sites, rather than just the presence of any variance in the gene.
The terms xe2x80x9cvariance scanningxe2x80x9d, or xe2x80x9cscanningxe2x80x9d refers to the method of rapidly determining whether there are nucleotide sequence differences between one or more of cDNA or genomic samples from one or more individuals or population samples. Preferably the method utilizes a computationally-based approach, e.g., as described herein.
In preferred embodiments in which a plurality of variances is determined, the plurality of variances can constitute a haplotype or haplotypes.
In the context of this invention, the term xe2x80x9chaplotypexe2x80x9d refers to a cis arrangement of two or more polymorphic nucleotides, i.e., variances, on a particular chromosome, e.g., in a particular gene. The haplotype preserves the information of the phase of the polymorphic nucleotidesxe2x80x94that is, which set of variances was inherited from one parent, and which from the other.
In the context of this invention, the term xe2x80x9canalyzing a sequencexe2x80x9d refers to determining at least some sequence information about the sequence, e.g., determining the nucleotides present at particular sites in the sequence or determining the base sequence of all of a portion of the particular sequence.
The term xe2x80x9cdrugxe2x80x9d as used herein refers to a chemical entity or biological product, or combination of chemical entities or biological products, administered to a person to treat or prevent or control a disease or condition. The chemical entity or biological product is preferably, but not necessarily a low molecular weight compound, but may also be a larger compound, for example, an oligomer of nucleic acids, amino acids, or carbohydrates including without limitation proteins, oligonucleotides, ribozymes, DNAzymes, glycoproteins, lipoproteins, and modifications and combinations thereof. A biological product is preferably a monoclonal or polyclonal antibody or fragment thereof such as a variable chain fragments; cells; or an agent or product arising from recombinant technology, such as, without limitation, a recombinant protein, recombinant vaccine, or DNA construct developed for therapeutic, e.g., human therapeutic, use. The term xe2x80x9cdrugxe2x80x9d may include, without limitation, compounds that are approved for sale as pharmaceutical products by government regulatory agencies (e.g., U.S. Food and Drug Administration (USFDA or FDA), European Medicines Evaluation Agency (EMEA), and a world regulatory body governing the International Conference of Harmonization (ICH) rules and guidelines), compounds that do not require approval by government regulatory agencies, food additives or supplements including compounds commonly characterized as vitamins, natural products, and completely or incompletely characterized mixtures of chemical entities including natural compounds or purified or partially purified natural products. The term xe2x80x9cdrugxe2x80x9d as used herein is synonymous with the terms xe2x80x9cmedicinexe2x80x9d, xe2x80x9ctherapeutic interventionxe2x80x9d, xe2x80x9cpharmaceutical productxe2x80x9d, or xe2x80x9cproductxe2x80x9d. Most preferably the drug is approved by a government agency for treatment of a specific disease or condition.
The terms xe2x80x9cdiseasexe2x80x9d or xe2x80x9cconditionxe2x80x9d are commonly recognized in the art and designate the presence of signs and/or symptoms in an individual or patient that are generally recognized as abnormal. Diseases or conditions may be diagnosed and categorized based on pathological changes. Signs may include any objective evidence of a disease such as changes that are evident by physical examination of a patient or the results of diagnostic tests which may include, among others, laboratory tests to determine the presence of variances or variant forms of certain genes in a patient. Symptoms are signs or indications in a patient of a disease, disorder, or condition that differs from normal function, sensation, or appearance, which may include, without limitations, physical disabilities, morbidity, pain, and other changes from the normal condition experienced by an individual. Various diseases or conditions include, but are not limited to, those categorized in standard textbooks of medicine including, without limitation, textbooks of nutrition, allopathic, homeopathic, and osteopathic medicine. In certain aspects of this invention, the disease or condition is selected from the group consisting of the types of diseases listed in standard texts such as Harrison""s Principles of Internal Medicine (14th Ed) by Anthony S. Fauci, Eugene Braunwald, Kurt J. Isselbacher, et al. (Editors), McGraw Hill, 1997, or Robbins Pathologic Basis of Disease (6th edition) by Ramzi S. Cotran, Vinay Kumar, Tucker Collins and Stanley L. Robbins, W B Saunders Co., 1998, or other texts described below.
The term xe2x80x9ctherapyxe2x80x9d refers to a process which is intended to produce a beneficial change in the condition of a mammal, e.g., a human, often referred to as a patient. A beneficial change can, for example, include one or more of: restoration of function, reduction of symptoms, limitation or retardation of progression of a disease, disorder, or condition or prevention, limitation or retardation of deterioration of a patient""s condition, disease or disorder. Such therapy can involve, for example, nutritional modifications, administration of radiation, administration of a drug, and combinations of these, among others. Another term that is synonymous with xe2x80x9ctherapyxe2x80x9d is xe2x80x9ccandidate therapeutic interventionxe2x80x9d and is used herein.
In a related aspect, the invention provides a set of computer instructions, which may be a program or set of programs, but is preferably a single program or a set of linked programs that encodes the functions for the method of the preceding aspect.
Preferably the set of computer instructions is embedded in a computer-readable medium, which may be, for example, one or more of read-only memory (ROM), random access memory (RAM), magnetic recording media such as magnetic tape, hard disks, floppy disks, and other magnetic disk formats, as well as in other formats such as optical and magneto-optical disks (e.g., compact disks (CDs) and disks for write-once-read-many (WORM) drives. Preferably, the medium is installed in or is part of a computer system. Such computer systems may be, for example, dedicated purpose computers, general purpose computers, and/or part of a computer network.
Preferably the instruction set is installed in or is part of a general purpose computer which can be part of a network, and also can be connected to a broader network such as the Internet, e.g., for data retrieval. In other embodiments, the instruction set is installed on a computer system in a manner such that the instruction set can be accessed over a network, e.g., over the Internet. In some embodiments, the set of instructions or a necessary part thereof is or can be downloaded from a remote computer over the network, or alternatively can be used for analysis with all or most of the functionality remaining on a storage computer or server. In the latter mode, analysis results can be transmitted to the remote computer.
Thus, the set of instructions for computer-based identification of sequence variances in nucleotide sequences preferably provides sequence comparisons of sets of at least 3 independent sequences of at least portions of a gene; and at least one from a set of filters to distinguish true variances from sequence errors, where execution of the set of instructions on a set of at least 3 independent sequences provides a result indicative of the probability that a sequence difference detected between the sequences in set represents a true variance.
The set of filters preferably includes at least one of: a filter to identify low quality sequence regions, a filter to identify adjacent base changes; a filter to characterize the probability of sequence error or probability of true variance based on sequence context, and a filter utilizing the detection of particular sequence differences at the same sites in multiple sequences as an indication that each such sequence difference represents a true variance. As described in the first aspect, the filters can be used singly or in any combination. The set of filters can also optionally include other filters.
Thus, in related aspects, the invention provides a computer-based or computer-related systems or devices useful for identifying gene sequence variances. Preferably the system is designed to allow access to and utilization of sequence information stored in remote databases.
In one aspect the invention provides a computer readable device that has at least 3 independent nucleotide sequences of at least portions of at least one gene recorded in the device, along with a computer program or programs which analyzes differences between the independent sequences to distinguish true variances from sequence errors.
The device preferably includes a medium selected from floppy disk, computer hard drive, optical disk, computer random access memory, and magnetic tape where the nucleotide sequences or the program or both are recorded on that medium.
As in the above aspects, the program or programs preferably provides functions which include comparing the at least 3 electronic nucleic acid sequences to identify sequence differences between said sequences and at least one of the described filters, which include filters for identifying and removing or discounting sequence differences in portions of the sequences where the number of sequence differences in an analysis window exceeds a predetermined limit; identifying and removing or discounting consecutive mismatches; assigning sequence differences a probability of representing a true variance based on sequence context; and utilizing the detection of particular sequence differences at the same sites in more than multiple sequences as an indication that each such sequence difference represents a true variance. The program or programs can optionally include other functions or filters as well. The result from execution of the program or programs is preferably a score derived from the probability that a detected sequence difference represents a true variance.
In a related aspect, the invention provides a computer-based system for identifying nucleic acid sequence variances. The system includes a data storage medium on which is recorded at least 3 independent nucleotide sequence corresponding to at least portions of at least one gene; a set of instructions allowing analysis of the sequences to identify sequence differences between the independent sequences and to distinguish true variances from sequence errors; and an output device. Preferably the output device is or includes a printer, a video display, and/or a recording medium.
Preferably the set of instructions provides the functions or filters as described in aspects above.
Similarly, in another related aspect, the invention provides a method for identifying nucleic acid sequence variances. The method includes providing a computer-based system for analyzing nucleic acid sequence data, where the system includes a data storage medium in which is recorded at least 3 independent nucleotide sequence corresponding to at least portions of at least one gene, and a set of instructions allowing analysis of said sequences to identify sequence differences between said at least 3 independent sequences and to distinguish true variances from sequence errors; and an output device; analyzing the independent sequences; and outputting results of the analysis to the output device.
Preferably the analysis includes the functions or filters as described in aspects above.
By xe2x80x9ccomprisingxe2x80x9d is meant including, but not limited to, whatever follows the word xe2x80x9ccomprisingxe2x80x9d. Thus, use of the term xe2x80x9ccomprisingxe2x80x9d indicates that the listed elements are required or mandatory, but that other elements are optional and may or may not be present. By xe2x80x9cconsisting ofxe2x80x9d is meant including, and limited to, whatever follows the phrase xe2x80x9cconsisting ofxe2x80x9d. Thus, the phrase xe2x80x9cconsisting ofxe2x80x9d indicates that the listed elements are required or mandatory, and that no other elements may be present. By xe2x80x9cconsisting essentially ofxe2x80x9d is meant including any elements listed after the phrase, and limited to other elements that do not interfere with or contribute to the activity or action specified in the disclosure for the listed elements. Thus, the phrase xe2x80x9cconsisting essentially ofxe2x80x9d indicates that the listed elements are required or mandatory, but that other elements are optional and may or may not be present depending upon whether or not they affect the activity or action of the listed elements.
Other features and embodiments of the invention will be apparent from the following description and from the claims.