The field of the invention is computational genetic analysis as applied to identifying and detecting polymorphic repeats in human genes.
The exponential rate of accumulation of genomic sequence data in public databases is making it possible to eventually understand, treat and eliminate potentially thousands of genetic diseases, predispositions and adverse drug/treatment reactions. Causes and contributing factors for hundreds of afflictions have been mapped to thousands of specific mutations, some being simple sequence repeat polymorphisms and some single nucleotide polymorphisms (SNPs). Although the now-routine process for identifying these mutations is long and expensive, the medical and commercial value of these discoveries has prompted numerous public and private institutions to apply massive resources to the identification of new potentially causative and correlative genetic variations. These large-scale SNP-centric projects currently underway generally involve the random and therefore low-efficiency re-sequencing of DNAs from several individuals (sometimes panels of patients with a particular affliction of interest) followed by attempts to relate a particular genotype and phenotype. In addition, the inherent sequencing error rate is generally higher than the incidence of most sequence variations, resulting in a very high false-positive rate. This effect is compounded by the well-established population genetic principle that the higher the impact of an allele, the more deleterious it is, the more rare it will be, and hence the less likely it is to be found by these methods. Despite these shortcomings, the value of knowing which variations are present in the population (whatever their frequency) and are linked to phenotypes is such that many companies and institutions are racing in an effort to discover these valuable alleles.
The positional cloning of many disease loci has been facilitated by high-resolution genetic maps. The precise localization of the DNA sequence responsible for a disease usually requires the development of very high density physical and genetic maps. The availability of multiple polymorphic genetic markers is crucial to this effort. Current widely-used methods for the identification of new simple sequence repeat polymorphisms involve PCR based and subcloning strategies, but the large quantities of human genomic sequence being generated by the human genome project are rapidly making these approaches obsolete. The knowledge of the frequency and level of polymorphism of the various types and sizes of microsatellites and variable number tandem repeats (VNTRs) allows one to predict, a priori, which tandem repeats are likely to be highly polymorphic from a single genomic sequence. For microsatellites, the level of heterozygosity has been observed to be directly proportional to the number of repeated units and inversely proportional to the size of the repeated unit.
While there currently exist several software applications for locating some microsatellites or larger tandem repeats, a comprehensive tool for the identification of, and generation of primer sequences for, those repeats correlated with a high probability of polymorphism has been lacking. Because of this need, we wrote software that takes as input human genomic sequence data and will output a list of oligonucleotide sequences that may be used as primers for PCR amplification of those tandem repeat sequences that are predicted to be highly polymorphic based on observations from the literature. A computational system for the prediction of polymorphic loci directly and efficiently from human genomic sequence was developed and verified. A suite of programs, collectively called POMPOUS (POlymorphic Marker Prediction Of Ubiquitous Simple sequences) detects tandem repeats ranging from dinucleotides up to 250-mers, scores them according to predicted level of polymorphism, and designs appropriate flanking prime its for PCR amplification1. 
Scheme 1. POMPOUS is a series of programs whose execution is directed by a Perl script. Together, TandMin and TandMax identify di-nucleotide to 250-mer repeats, which are then consolidated by the Perl script, and then oligonucleotide primers are designed using our code PRIMO for those repeats whose characteristics are above thresholds set to select only for those repeats that are highly probable to vary.
An association between certain repeating microsatellite elements and polymorphism has been reported17, caused by the expansion and contraction of the core repetitive unit by what is believed to be the mechanism of either slipped-strand mispairing25, uneven recombination or some combination of both15. In fact, several inherited neurological disorders have been linked to changes in the copy numbers of certain tri-nucleotide repeats. The repetitive element in some diseases lies directly in the coding sequence, such as Machado-Joseph8 (CAG repeat), Haw River Syndrome7 (CAG), Huntington""s Disease3.19 (CAG) and Fragile-X Syndrome4 (CGG). However, the location of other polymorphic repetitive elements vary with respect to the coding sequence as in Fredreich""s Ataxis5 (GAA, intron), Myotonic Dystrophy6 (CAG, 3xe2x80x2UTR), and a gene suspected to be linked to Hyperandrogenaemia for which the repeat occurs in the 5xe2x80x2 UTR12. Triplet repeat expansion diseases (TREDs) also include spinal and bulbar muscular atrophy (SBMA) and fragile X syndrome (FRAXA). Short cytosine-adenine-guanine (CAG) expansions are characteristic for spinal and bulbar muscular atrophy (SBMA), dentatorubral-pallidoluysian atrophy (DRPLA) and spinocerebellar ataxia (SCA) type 1, 2, 3, 6 and 7 (Nilssen O., Tidsskrift for Den Norske Laegeforening. 119(20):3021-7, Aug. 30, 1999). Besides studies of tri-nucleotide repeats associated with neurological disorders3-8,16,24,36, work on repeats has focused on specific genes of interest containing predominant repeat units having a known association with polymorphism such as CAG or CCG13.
The invention provides methods and compositions for identifying polymorphic repeats in genes. In one embodiment, the invention provides methods for identifying a candidate polymorphic repeat within a coding sequence. This embodiment involves scanning a coding sequence for multiple, different candidate polymorphic repeats and generally involves (a) detecting tandem repeats in a target coding sequence; (b) scoring the repeats for polymorphic probability; and (c) generating a dataset correlating the repeats with polymorphic probability. The coding sequence is a transcribed sequence, includes a CDS region and 3xe2x80x2 and 5xe2x80x2 untranslated regions (UTRs) and may be derived from a single coding sequence, a concatamerized coding sequence, a coding sequence library (e.g. cDNA library), etc. The coding sequence may be derived physically from transcribed polynucleotide or in silico from genomic sequence or other sequence comprising coding sequence. The detecting, scoring and generating steps are preferably implemented by a computer program, preferably wherein the scoring step comprises determining at least the type, number and purity of the repeats. In a particular embodiment the program is the Rep-X algorithm.
In another embodiment, the invention provides methods for identifying an actual polymorphic repeat by validating a computationally identified candidate polymorphic repeat in populations of natural genes. This embodiment generally involves (a) computationally identifying a candidate polymorphic repeat; (b) detecting the candidate polymorphic repeat in each of a population of different coding sequences, each from different individuals; and (c) determining whether the candidate polymorphic repeat is polymorphic in the population.
In yet another embodiment, the invention correlates validated, computationally derived polymorphic repeats with phenotypic variations, which may be manifested or prospective, i.e. present as a predisposition. These correlates are used to detect the presence or absence of such phenotypic variation in test genes.
The invention also provides isolated polynucleotides comprising variants of wild-type, unconventional repeats disclosed herein, polypeptides encoded by such variants, as well as binding agents, such as antibodies, specific thereto. Table 1 provides a dataset of human polymorphic repeats as mined from the Unigene database. The polymorphisms are distributed among known genes in 5xe2x80x2 UTR, 3xe2x80x2 UTR, border and CDS regions and in uncharacterized genes. The dataset encompasses both otherwise ascertainable conventional repeats and unconventional repeats not previously characterized or suggested. Conventional repeats are readily delineated by computational exclusion.
The invention provides methods for detecting variance at a polymorphic repeat, comprising the step of detecting in a test gene or coding region the presence or absence of variance at an unconventional polymorphic repeat of Table 1, particularly a repeat in an uncharacterized entry. The detecting step may be effected by any convenient means, including specific polynucleotide hybridization and inferentially by determining the amino acid sequence of a polypeptide translate of the coding region.
The following descriptions of particular embodiments and examples are offered by way of illustration and not by way of limitation. The invention provides tools for identifying genotypic variation by measuring the polymorphism rate for repeats in genes, correlating this data with other genetic and genomic data, selective testing of this class of genes against genetic panels (e.g. affected/non-affected, families), and/or through association with homologous genes in model organisms which are directly manipulated to reveal phenotype.
In one aspect, the invention provides methods for identifying a candidate polymorphic repeat within a coding sequence, which comprise the steps of (a) detecting tandem repeats in a target coding sequence; (b) scoring the repeats for polymorphic probability; and (c) generating a dataset correlating the repeats with polymorphic probability to identify a candidate polymorphic repeat. The repeats may vary from one to hundreds of nucleotides in length and may be present in from two to dozens of copies. The repeats may be of varying degrees of identity (purity or homogeneity), preferably substantially pure, more preferably pure. The methods are applied to any coding sequence, including a single coding sequence (e.g. a single cDNA sequence), a concatamerized coding sequence and a coding sequence library. The coding sequence may determined physically, e.g. from actual transcribed polynucleotides, or conceptually, e.g. inferred in silico from genomic sequence. The detecting, scoring and/or generating steps may be implemented manually and/or by computer. In a particular embodiment, the steps are effected by a computer program, such as the Rep-X program described below. The scoring step generally comprises determining at least the type and number of the repeats, though measures of repeat purity may also be considered.
Rep-X is an algorithm that works by iteratively comparing each base pair, x, in a given nucleic acid sequence with the ones following it. Upon recognition of two matching base pairs y pairs ahead in the sequence, a one-dimensional array m(s), containing the values of s, a summed amount, increments the value corresponding to that nucleic acid position in the sequence by one plus the number of similar matches that occurred immediately and consecutively before it. Another one-dimensional matrix stores the nucleic acid position at which the match occurred in the sequence. The summed amount represents the amount of consecutive base pair matches for x that have occurred with all prior base pairs, xxe2x88x921, at that point in the sequence. Upon any mismatch, the summed value is set equal to zero. Once the algorithm has finished running on the sequence, the one-dimensional matrix contains the longest sequence match (besides itself) with respect to that position in the sequence. Values equal to or above 10 in this one-dimensional summary matrix are considered to be significant length matches and the corresponding base pair coordinates of the highest summary value are compared to the x coordinate at which the significant length match occurred, giving a distance value d between the coordinates. If the first d nucleotides of the significant length subunit are equal to the second (2d) nucleotides, the sequence is considered repetitive with the repeating unit length equal to d and the number of repeating units equal to d/s. Note that a repeating unit by this calculation can occur in fractions. For example, the sequence ATGATGATGATA could be considered to have 3 and ⅔ repeats of the sequence ATG. This part of the algorithm comprises the determination of repetitive sequences, which rules (see Table 2) are used to determine which of these sequences have significant potential for polymorphism. Comparisons are also done to the backwards reverse strand sequence such that comparisons are made with x+1 rather than xxe2x88x921 and would be equivalent to regions of the sequence able to form hairpin or palindromic structures. Also, the one-dimensional matrix for both the repetitive calculations and the base-pair forming calculations are summed, giving a collective measure of self-similarity or self-complementarity for the sequence when divided by its length. Such measures of self-similarity can be used to determine the likelihood any given sequence has to undergo an uneven recombination. Similarly, the self-complementarity measure can be used to determine the amount of RNA secondary structure formation from transient complementary base pairing.
In another aspect, the invention provides methods for identifying a polymorphic repeat within a coding sequence, comprising the steps of:(a) identifying a candidate polymorphic repeat as described herein; (b) detecting the candidate polymorphic repeat in each of a population of different coding sequences, each from different individuals; and (c) determining whether the candidate polymorphic repeat is polymorphic in the population, wherein polymorphism of the candidate polymorphic repeat within the population indicates a polymorphic repeat.
In another aspect, the invention provides methods for correlating variance at a polymorphic repeat with phenotypic variation, comprising the steps of: (a) identifying a polymorphic repeat as described herein; and (b) correlating variance at the polymorphic repeat with phenotypic variation.
In another aspect, the invention provides methods for detecting variance at a polymorphic repeat by (a) identifying a polymorphic repeat as described herein; and (b) detecting in a test coding region the presence or absence of variance at a polymorphic repeat. While the general method detects a wide variety of variation from the disclosed wild-type polymorphisms, including variation in repeat number and sequence (i.e. purity), in a particular embodiment, the variants differ only in repeat number. Variants are readily localized by flanking sequences, e.g. with flanking sequence-specific primers.
The disclosed methods were used to perform a directed, systematic and exhaustive characterization of simple sequence repeat polymorphism in transcribed (coding) regions of the UniGene database. UniGene, published by NCBI, is a system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent (are part of) a unique gene, as well as related information .such as the tissue types in which the gene has been expressed and map location. Table 1 (appended) discloses polymorphisms found in human genes of the UniGene database. The dataset encompasses both conventional and unconventional repeats. Conventional repeats are known or ascertainable from conventional polymorphic patterns including glutamine-encoding repeats (e.g. CAG) expressed in neurological and brain cells, CA repeats in the untranslated regions, and from genes known to be polymorphic (e.g. apolipoprotein(a), androgen receptor, fibrinogen, PDGFA (platelet derived growth factor A), FLT-1 (kidney disease), DBP, adenosine deaminase (ADA), dopamine D4, POU genes, collagenxcex12, HCAD, phosphoglycerate kinase, NOS2a, CYP2G1, serotonin transporter gene, plasminogen activator inhibitor-1 (PAI-1), elastin, etc.). Conventional repeats are indexed in Medline and are readily delineated by computational exclusion. Unconventional repeats of Table 1 are not previously characterized or suggested, i.e. have no associated indicia of polymorphism in Medline using a Dec. 31, 1999 publication date cutoff In addition, unconventional repeats are restricted to those found in the disclosed human sequences of Table 1 and do not encompass vector sequences, which are readily ascertainable and computationally excluded by those skilled in the art of computational genomic analysis (e.g. by BLAST analysis against a vector sequence database such as published on Dec. 31, 1999 by NCBI).
While compiled for convenience in a single table, utilization of the unconventional polymorphisms of Table 1 does not require en mass usage. In fact, the polymorphisms may be used in an convenient compilation or as individual probes/targets. Hence, the polymorphic repeats of Table 1 are to be considered as individually disclosed and subject to discrete claims. Useful compilation criteria include any of those of Table 1, including regions, repeat type, length, Unigene identifiers, etc. Particular embodiments include compilations of unconventional repeats having a repeat number of at least N, wherein N is selected from an integer from 2 to 30; unconventional repeats having a unit length of at least M, wherein M is selected from an integer from 2-200; and unconventional repeats having a Table 1-referenced Genbank entry having an unannotated start and stop site or a xe2x80x9c/cds=UNKNOWNxe2x80x9d indication or the absence of a xe2x80x9c/cds=xe2x80x9d field.
In another aspect, the invention provides methods for detecting variance at a polymorphic repeat by detecting in a test coding region the presence or absence of variance at an unconventional polymorphic repeat of Table 1, particularly an unconventional repeat in an uncharacterized entry.
The detecting may be effected by any convenient means, including directly by detecting or determining polynucleotide sequence (e.g. by sequencing or by probing arrays of immobilized polymorphism-specific polynucleotides) and inferentially by determining or detecting the amino acid sequence of a polypeptide translate of the coding region (e.g. by sequencing or using sequence-specific binding reagents, such as specific antibodies). Accordingly, the invention also provides isolated polynucleotides comprising and polypeptides encoded by a variant of an unconventional polymorphic repeat of Table 1, particularly a repeat in an uncharacterized entry, as well as binding agents, such as antibodies, specific thereto. Variants are readily and unambiguously identified by flanking sequence of the disclosed wild-type repeat and deviation from the disclosed repeat in terms of repeat number or purity, wherein preferred variants differ only in terms of repeat number. The term antibody is used inclusively to encompass polypeptides comprising antibody complementary determining regions sufficient for specific binding, including specific antibody fragments, hybrids, etc.
Classes of diagnostic repeats of particular interest include those of Table 1 of each of 3xe2x80x2 UTR, 5xe2x80x2 UTR, border, CDS and unknown regions; frame-shifting mutations, i.e. those not a multiple of three in length (see Table 6, below); those relatively common in genomic DNA but rare in coding regions (See Table 7, below), such as cyclic permutations of AAT, TTA, TTG and TTC.
Individual diagnostic repeats of particular interest include those of the human Sperm Acrosomal Protein-1 repeat, consisting of eight TG repeats at position 875 of gb=AF047437; the human Herpes Viral Entry Protein C repeat, consisting of eight GAG repeats at position 1464 of gb=AF060231; and the human MEK kinase 1 (MEKK 1) repeat, consisting of eight AAC repeats at position 2290 of gb=AF042838.
The subject polynucleotides and fragments thereof may be joined to other components such as labels or other polynucleotide/polypeptide sequences (i.e. they may be part of larger sequences) and are of synthetic/non-natural sequences and/or are isolated, i.e. unaccompanied by at least some of the material with which it is associated in its natural state, preferably constituting at least about 0.5%, preferably at least about 5% by weight of total nucleic acid present in a given fraction, and usually recombinant, meaning they comprise a non-natural sequence or a natural sequence joined to nucleotide(s) other than that which it is joined to on a natural chromosome. Recombinant polynucleotides comprising the subject sequences or fragments thereof, contain such sequence or fragment at a terminus, immediately flanked by (i.e. contiguous with) a sequence other than that which it is joined to on a natural chromosome, or flanked by a native flanking region fewer than 10 kb, preferably fewer than 2 kb, more preferably fewer than 500 bases, most preferably fewer than 100 bases, which is at a terminus or is immediately flanked by a sequence other than that which it is joined to on a natural chromosome. While the nucleic acids are usually RNA or DNA, it is often advantageous to use nucleic acids comprising other bases or nucleotide analogs to provide modified stability, etc.
The subject polypeptides and fragments thereof are isolated or pure: an xe2x80x9cisolatedxe2x80x9d polypeptide is unaccompanied by at least some of the material with which it is associated in its natural state, preferably constituting at least about 0.5%, and more preferably at least about 5% by weight of the total polypeptide in a given sample and a pure polypeptide constitutes at least about 90%, and preferably at least about 99% by weight of the total polypeptide in a given sample. The polypeptides may be synthesized, produced by recombinant technology, or purified from cells. A wide variety of molecular and biochemical methods are available for biochemical synthesis, molecular expression and purification of the subject compositions, see e.g. Molecular Cloning, A Laboratory Manual (Sambrook, et al. Cold Spring Harbor Laboratory), Current Protocols in Molecular Biology (Eds. Ausubel, et al., Greene Publ. Assoc., Wiley-Interscience, NY) or that are otherwise known in the art.
In a particular embodiment, the subject polypeptides provide specific antigens and/or immunogens for making specific antibodies, especially when coupled to carrier proteins. For example, polypeptides are covalently coupled to keyhole limpet antigen (KLH) and the conjugate is emulsified in Freunds complete adjuvant. Laboratory rabbits are immunized according to conventional protocol and bled. The presence of specific antibodies is assayed by solid phase immunosorbant assays using immobilized corresponding polypeptide.