The advent of new experimental technologies that support molecular biology research have resulted in an explosion of data and a rapidly increasing diversity of biological measurement data types. Examples of such biological measurement types include gene expression from DNA microarray or Taqman experiments, protein identification from mass spectrometry or gel electrophoresis, cell localization information from flow cytometry, phenotype information from clinical data or knockout experiments, genotype information from association studies and DNA microarray experiments, etc. This data is rapidly changing. New technologies frequently generate new types of data.
High-throughput techniques are generating huge amounts of biological data which are readily available, but which must still be interpreted. Experiments that measure thousands of genes and proteins (microarray, imminent protein-array technologies, etc.) simultaneously and under different conditions are becoming the norm in both academia and pharmaceutical/biotech companies. A large number of these experiments are conducted in an attempt to solve a piece of the puzzle, that of understanding biological processes. Biologists are in need of tools that help them establish relationships between these heterogeneous data, and extract, build and verify interpretations and hypotheses about these data.
In addition to data from their own experiments, biologists also utilize a rich body of available information from internet-based sources, e.g. genomic and proteomic databases, and from the scientific literature. The structure and content of these sources is also rapidly evolving. The software tools used by molecular biologists need to gracefully accommodate new and rapidly changing data types.
Scientific text (publications, reports, interpretations, patents, etc.) and biological models (pathway diagrams, protein-protein interaction maps, etc.) are great repositories of information related to the current understanding of the functioning of biological processes. With the high-throughput experiments and their results that scientists have to deal with, there is a need to identify information about entities (genes, proteins, molecules, diseases, drugs, etc.) of interest from the vast literature and existing biological models, and be able to verify/validate these using proprietary experimental results.
A number of literature (e.g., Pubmed, Google-Citeseer, OMIM, USPTO Patent database) and biological model (e.g., KEGG, TRANSFAC, TRANSPATH, SPAD, BIND, etc.) databases have been developed (both public domain and proprietary) that allow users to query and download scientific articles and biological models of interest. However, these databases cannot effectively capture the context of a user query.
For example, search results returned by a literature database are based on the occurrence of keywords provided by the user as search terms, and these search terms are limited in their ability to capture the exact information the user is searching for. Therefore, results returned by these search engines can be very broad and users are left to sift through all the text in the returned results to extract information of interest. In other words, the actual content in the text is not understood, and the onus is on the user to manually read the abstract and judge relevance of the search results. It is clearly not possible to capture any arbitrary user query's context on the server side, i.e., at the central database server.
Since only the user knows the context of the query, it would be useful to provide tools to the user to query these databases and extract information that is contextually relevant. Presently, there do not exist tools that enable the user to do so. In fact, most databases are centralized services and base relevance judgments on the data provided by the user (in terms of the search terms, interaction with the database, etc.).