1. Field of the Invention
The present invention relates generally to database processing, and more particularly to a system and method for efficiently searching and extracting relevant data, and for performing contextual data searches on databases comprising named annotated text strings, such as biological sequence databases.
2. Related Art
For nearly thirty years, scientists have been collecting biological sequence data on different types of organisms, ranging from bacteria to human beings. Much of the data collected is stored in one or more databases shared by scientists around the world. For example, a genetic sequence database referred to as the European Molecular Biology Lab (EMBL) gene bank is maintained in Germany. Another example of a genetic sequence database is Genbank, and is maintained by the United States Government.
Specifically, Genbank is a public nucleic acid sequence database operated by the National Center for Biotechnology Information (NCBI), a part of the National Library of Medicine (NLM) which is itself a part of the National Institutes of Health (NIH). Currently, the Genbank database may be queried using NCBI's Website (www.ncbi.nlm.nih.gov) or can be accessed through one of several specialized NCBI e-mail servers. Additionally, the Genbank database may be downloaded either in its entirety or in part from NCBI's anonymous FTP server.
Genbank is compiled from international sources and currently comprises sequence data in the following 13 categories: "primate," "mammal," "rodent," "vertebrate," "invertebrate," "organelle," "RNA," "bacteria," "plant," "virus," "bacteriophage," "synthetic," and "other". Genbank is logically organized as 17 sub-databases sharing a common naming convention and schema. These sub-databases correspond roughly to the major research organisms listed above, derived sequences such as plasmids and patented sequences, and sequences that are produced by the various complete genome projects.
The potential benefits gained by studying genetic sequences and understanding genetic coding are boundless. For example, such understanding can lead to discovery of genes that affect incidences and the severity of diseases. Understanding genetic sequences can lead to diagnosis, treatment and prevention of genetic diseases and the design of drugs that can specifically target critical protein sites. In addition, studying genetic sequences facilitates our understanding of evolutionary biology.
The Human Genome Project (HGP) is an international research program carried out in the United States by the National Human Genome Research Institute and the US Department of Energy. The ultimate task of sequencing all 3 billion base pairs in the human genome will provide scientists with a virtual instruction book for a human being. From there, researchers can begin to unravel biology's most complicated processes.
The problem is that such enormous undertakings necessarily generate huge and ever-increasing amounts of data. Databases such as Genbank facilitate the process of organizing and disseminating such data to scientists around the world. However, it has proven to be extremely challenging not only to manage and disseminate the data, but more importantly, to perform meaningful analysis on such voluminous databases. The data analysis problem is due is part, to the format of the data provided by databases such as Genbank.
The Genbank database and other similar databases comprise a set of named annotated text strings (NAT). The so-called "text string" portion of the Genbank and other biological databases is the actual recorded sequence data. The annotations comprise documented information about the sequence data or portions thereof. Each element or entry has a unique name. Such databases are inherently difficult to process using conventional database query languages, such as SQL and the like.
Currently, the version of the Genbank database available through their FTP Website consists of a set of individual files. Each file contains sequences from a single sub-database, which may itself comprise multiple files. The partitioning of Genbank in this fashion allows investigators to load (and search) only as much or as little of the database as they require. This has proven to be quite an advantage as the current Genbank release (release 111.0, April 1999), contains over 3.5 million entries ("loci") and requires about 7.5 GB of (uncompressed) disk space.
However, performing meaningful data analysis on the voluminous Genbank database and other similar databases has proven to be extremely problematic. This is due to many factors, including the complexity, the data format, and the shear size of the data itself. Such data is very difficult to analyze using conventional means. In addition, because these databases have been in place for so many years, and are shared by scientists throughout the world, it is difficult to incorporate changes, even if such changes are advantageous to researchers.
Thus, at least for the foreseeable future, researchers must continue to deal with such data in much the same format as is currently implemented. The difficult-to-work-with nature is unavoidable due to many factors as listed above, but also because our understanding of the sequences is incomplete and often incorrect.
Further, there is no standard vocabulary by which the entries are described. For example, comments and notes are typically entered by researchers in plain text, which is generally unrestricted as to its format. For example, suppose a researcher conducts a search for bacteria sequences that are resistant to antibiotics. This search would be trivial if all researchers were restricted to particular keyword description for this particular characteristic, such as "antibiotic resist" or the like. However, because no restrictions are enforced, some researchers describe this phenomena with different terms such as "antibiotic resist," "penicillin resistance," "beta-lactamase" and the like.
In addition, it would be desirable and very valuable to conduct searches for certain sequences that are in context of other sequences. This is a very difficult problem that has thus far remained unresolved using current systems.
Therefore, what is needed is a system and method that can operate on named annotated string databases such as biological sequence databases, in an efficient and meaningful manner. Further, what is needed is a system and method that can perform in-context database searches on named annotated text string databases.