1. Field of the Invention
This invention relates generally to a method and system to analyze, interpret, and distribute the genetic information of any biological organism, either natural or artificially modified or created. It is noteworthy that over the five years from the “Provisional Application” the nomenclature of Genomics has significantly evolved. The use of the word “genetic” today has an extended meaning “genomic” throughout this “CIP Application.”
2. Description of the Prior Art
Watson and Crick in 1953 discovered that DNA contains A, C, T, and G base-pairs in a double-helical arrangement. For example, see “A Structure for Deoxyribose Nucleic Acid,” Nature, by Watson, J. D. and Crick, F. H. C., Apr. 2, 1953, v. 171, p. 737 (1953), which is hereby incorporated by reference. Their discovery immediately shed light on the utility of the double helix in heredity (how the double helix would split and each string of the half base-pairs reconstitute). It also became evident that the sequence of base-pairs is the code of life, but the discovery of the code is not the same as deciphering the code.
Paul Berg in 1980 disclosed a method, by which DNA base-pair sequences composed of A, C, T, and G can be mapped out, such that properties of DNA base-pair sequence could be gradually revealed. For example, see “Biochemical method for inserting new Genetic Code into DNA of simian virus 40: Circular SV40 DNA molecules containing lambda phage genes and the galactose operon of Escherichia coli,” by Jackson, D. A., Symons, R. H. & Berg, P., Proc Nat Acad Sci USA 69, 2904-2909, October (1972), which is hereby incorporated by reference. In the 1990's, The Human Genome Project resulted in a broad genome map, which was deposited into data banks. Sequencing of DNA unveiled, in constantly updated data banks, the raw information about the DNA sequence of several species.
However, no one claimed (and nobody claims even at this time of submission of the “CIP”) that the mere mapping of the human genome actually deciphered the code; that is, an explanation was given how the genic and non-genic parts of the DNA, together, as well as the RNA, non-coding RNA, and microRNA systems govern the growth (physiological and pathological) of organelles (such as a single cell), organs (such as the lung) and organisms (such as a bacterium with the smallest DNA of the free living organisms, Mycoplasma Genitalium), let alone a human being. Initially, the human DNA was mapped out in broad detail in 2001 (e.g., see Human Genome Sequencing Consortium: “Initial Sequencing and Analysis of the Human Genome”, Nature, 409. (2001): 860-921; and Venter, J. C (and 274 co-authors) “The Sequence of the Human Genome”, Science 291 (2001) 1304-1351), which are hereby incorporated by reference. Nonetheless, while the discoveries of the double helix, and its mapping in 2001, have been impressive achievements, neither provided a means to interpret (decipher) the genetic code itself.
For example, until June of 2007, 98.7% of the human DNA was considered by many scientists to be “junk” DNA or “introns” and other “non-coding sequences”; in this application by “intron” we mean its general definition of all “non-protein coding DNA.” Only 1.3% of the human DNA consists of “exons” (e.g., protein-coding DNA, meant in this application in broad terms as “genes”). However, some scientists suspected even at the time of the “Provisional Application” (2002) that most of the non-coding DNA “introns” actually assist gene expression in some way, as their removal is lethal. The ENCODE pilot-results now in 2007 eliminate further “undue experimentation” to establish that the repetitive nature of the “non-coding DNA” and their apparent functionality requires inventions such as it was submitted by the Provisional, Original, and now this CIP Application.
Repetitive Nature of Genomic Nucleotide Sequences
Genomic nucleotide sequences are widely known to contain repetitive segments. There are, however, three different types, that this “CIP Application” improves upon for clarification: “Identical Dispersed Repeats” (hereinafter IDR, or “Identical Repeats”), “Closely Similar Repeats” (hereinafter CSR, or “Close Repeats”), and “Identical Continued Repeats” (hereinafter ICR, or “Runs”).
Identical Dispersed Repeats (IDR, or “Identical Repeats”)
Nucleotide sequences with identical nucleotides at all positions in the repeated occurrence of said sequences, where the said sequences can be dispersed at any parts of the genome. An example of IDR or “Identical Repeats” is the line 3 occurring also in an identical manner in line 4 in FIG. 1.
Closely Similar Repeats (CSR, or “Close Repeats”)
Nucleotide sequences with not identical nucleotides at all positions in the repeated occurrence of said sequences, but permitting up to an including ⅓ of the nucleotides being at variation from the “reference sequence”, where the said sequences can be dispersed at any parts of the genome. An example of CSR or “Close Repeats” is all lines except 3 and 4 in FIG. 1, where the “reference sequence” is contained in line 3 or 4, and it is clearly visible that the blacked out nucleotides that are at variation from the reference are less then ⅓ of the total number of the nucleotides in the reference sequence.
Identical Continued Repeats (ICR, or “Runs”)
Very short (2-6 nucleotide) sequences with identical nucleotides at all positions, where the repeats follow one another in a continued fashion, at two or more times, on occasion up to hundreds of times. (An example of ICR or “Run” is “CA” re-occur immediately one after the other, or the “GAA” triplet-run that is known to be causing Friedreich' Spinocerebellar Ataxia if the number of triplets is beyond an “acceptable range” (to be defined later)). The importance of distinction between Identical Repeats, Close Repeats and Runs is, that “Identical Repeats” and “Close Repeats” this CIP Application considers as capable of exhibiting fractal organization, while Runs in several instances are known to be the cause of hereditary diseases; and this CIP Application considers them “defects” when inserted into a fractal structure. (The skilled artisan will note that IDR, ICR, CSR are not necessarily mutually exclusive, just like in number theory “odd” and “prime” numbers permit some “odd” number to be “prime”—but not all odd numbers are prime).
FIG. 1 of this “CIP Application” provides a hard-copy reference of IDR and CSR sequences, a copy from the article “Kangaroo, a Mobile Element From Volvox carteri, Is a Member of a Newly Recognized Third Class of Retrotransposons,” by Leonard Duncan, Kristine Bouckaert, and David L. Kirk, in Genetics, Vol. 162, 1617-1630, December 2002, which is hereby incorporated by reference. The sequence was duly attached electronically in the format required by USPTO in the Parent Application. The Sequence Listing is submitted in a file named CRF_Sequence_ASCII.txt on a compact disc. The material on this compact disc is hereby incorporated by reference. The above quoted article provides good examples of the highly repetitious nature of DNA in several species. Such repetitions are also present in the 98.7% of the human DNA referred to “junk” DNA. Even from earlier examples it became increasingly evident at the time of inception of the “Provisional Application” that the DNA base-pair sequences are highly repetitious. FIGS. 3, 4, and 5 show, only for illustrative purposes of how the concept of “fractal DNA governing the growth of fractals of organisms” originated, CSR sequences that can be displayed in four lines to show “close similarity” (FIGS. 3 and 4), and it is further illustrated that the first two and second two lines can be joined such that the two halves of the sequence show “close similarity” (FIG. 5).
Some “fractal features” (see the inventor's definitions in Section “Specification”) of DNA was recognized in articles, such as “Hints of a language in junk DNA,” by F. Flam, Science 266:1320 (1994), and “Linguistic features of non-coding DNA sequences,” by Mantegna R N. et al., Physical Reviews Letters 73: 3169-3172, 1994, which is hereby incorporated by reference. These articles were based on the use of fractal geometry, as disclosed in The Fractal Geometry of Nature, by Benoit B. Mandelbrot, W. H. Freeman and Company, New York (1977). Chapter 39, pp. 349-390: “Mathematical Backup and Agenda,” is hereby incorporated by reference.
Mandelbrot disclosed mathematics describing elements of nature, coastlines, landscapes, plant arbors, etc. with non-integer (fractal) dimension. For instance, coastlines are lines, yet their dimensionality is not one, as is the case of a straight line, nor does it fill the entire two dimensions of the flat water-surface. (Their dimensionality is a non-integer number somewhere between 1 and 2, depending on how completely the line fills the plane).
Mandelbrot investigated the relationship between fractals and nature using the discoveries made by Gaston Julia, Pierre Fatou and Felix Hausdorff, (see The Fractal Geometry of Nature, Freeman, revised edition (1983)), which is hereby incorporated by reference. Mandelbrot disclosed that many fractal patterns existed in nature, and that fractal analysis could accurately model some natural phenomena. He also introduced new types of fractals to model more complex structures like trees or mountains. By furthering the idea of a fractional dimension, Mandelbrot made fractals a very rich field of analysis.
In addition, information that distinguishes one fractal set from another is contained in a “residue” left over after iterated function systems have been employed to compress the data in the sets. For computation, this “residue” is often the most important element. Compression makes fractals ideal for genetic code encapsulation of the intricate features of complex organisms that would otherwise require more information than the DNA is capable of coding if separate (uncompressed) information were used. One example of such intricacy would be the specification of each branchlet of the neurons in the brain (numbering approximately 10 to the 12 exponential power).
Efforts to sequence the genome have relied on a map-based approach, because over 50 percent of the genome in higher mammals (up to 98.7% in the human genome) is repetitive. So many parts of the genome look similar to other parts (of the genome) that if you only work with small pieces of genetic code, it is tempting to try to stick similar pieces from different parts together. “The physical map allows us to work with large pieces and to know where the little ones are supposed to go,” according to John D. McPherson, co-director of the Washington University Genome Sequencing Center.
For several years it has been suspected, that junk DNA may not be junk after all. (Quoted from Gene exchange #2, 1996). Although 98.7% of the DNA in human DNA does not obviously code proteins, and appears to consist of “meaningless” repetitive sequences, the possibility that this useless DNA has some unknown function has fascinated scientists.
It is well established that intron and other non-coding sequences DNA sequences regulate gene expression in positive and negative ways, provide post-transcriptional regulatory options, and provide structure among other functions. However, the problem of how introns, together with the exons, could be used to understand the meaning of the coded DNA message has previously been unresolved.
More than 95 percent of DNA is called “junk DNA” by molecular biologists, because they are unable to ascribe any function to it. However it has been found that the sequence of the syllables is not random at all and has a striking resemblance with the structure of human language. Therefore, scientists now generally believe that this DNA must contain some kind of coded information. But the basic concept of the code and its function is still largely unknown. It has been speculated that this region of DNA may contribute to the cellular processes, such as regulation of transcription. Therefore, deciphering the information coded in the regulatory regions may be critical to the understanding of transcription in a genomic scale. Yet the development of computational tools for identifying regulatory elements has lagged behind those for sequence comparison and gene discovery. Former approaches to decipher regulatory regions use co-regulated genes and then find a pattern common to most of the upstream regions.