The sequencing of the human genome has lead to the development of scientific fields such as pharmacogenomics, and personalized medicine. The genetic profile plays a vital role in these fields, which involve a significant amount of processing on the sequence data itself. The complete human genome is thought to be approximately 4 billions bases in length. Thus, storing information for a large population, and allowing efficient access to these sequences, is desirable.
Further, in some cases, treatment provided to a patient for a specific disease depends upon the patient's genetic profile. The genes that are expressed (or not expressed) depend on the genetic profile of the patient. The expression (or non-expression) of some genes leads to the observed disease (phenotypes). The levels of expression and also the kind of expression (which defines the structure of the protein) determine the type of treatment, and the drugs prescribed.
The genetic profile plays a vital role in the drug-discovery process, especially in the initial stages of screening of targets. Companies are expected to develop effective (both in cost and efficacy) drugs, which is possible only by having an effective drug discovery process. The identification and screening of targets and the development/identification of leads takes up a large proportion of the investment in a drug discovery. Every false positive adds a significant cost until identified as ineffective.
Various association studies using genetic profiles and expressed phenotypes allow scientists to prune the target search space effectively. This allows the time taken for discovery to be reduced, and also allows them to choose the target population on whom the drug would be effective and also results in reduced patient targeting time and higher efficacy of the drug on the target population.
Currently, portions of the genetic profile are stored and processing is performed using these short sequences. With new discoveries and ever improving understanding of the genetic sequence, the requirement to store entire sequences becomes inevitable.
The current high-level structures used to annotate sequence data are in the form of markers, exons etc. The bio-dictionary is one such effort, in which markers with sufficient support have been identified and annotated. Similarly, other dictionaries can be developed that contain patterns that identify specific markers/structures among the sequences that are most relevant to the study.
Accordingly, a need exists for an improved manner of data representation for genetic information.