The completion of mapping of the human genome in 2000, has led to an increased focus on functional genomics, i.e., extracting functional knowledge regarding various biological processes. Various experimental methods and tools are being invented to shed light into the functioning of processes within various organisms, with the final goal being to understand these in humans. A common way to represent the known functional biological knowledge is via pathway diagrams, cellular networks, and diagrams of biological and chemical models.
These representations are used to display information such as signal transduction pathways, regulatory pathways, metabolic pathways, protein-protein interactions, etc. These diagrams represent biological relationships (such as bind, cleave, inhibit, promote, catalyze, etc.) between entities (genes, proteins, mRNA, other molecules of interest) along with their localization within the cell, tissue, or organism. These visual representations are graphical in nature and are static images, i.e., they cannot be revised, supplemented or otherwise edited. Hence, they present the results for human visualization, rather than in a machine interpretable format.
Biologists are in need of tools that facilitate their use of biological models beyond the ability to simply visually compare such models with other information and data. There is a need for tools which would enable a researcher to not only view various biological models, but to also supplement, edit or otherwise modify these models in accordance with the researchers understandings gained from research done (e.g., comparison with local data) as well as from comparisons with other published data.
A number of biological model (e.g., KEGG, Transfac, Transpath, SPAD, Bind, etc.) databases have been developed (both public domain and proprietary) that allow users to query and download biological models of interest. However, as noted, the user can only view these biological models after downloading them, and can not add meaningful data or edits to a model given its static nature. Thus, it is cumbersome to import these diagrams, manually extract contents from them, and link the extracted information to other types of data (such as experimental data, scientific text, information about entities of interest, etc.).
Although there exists a great deal of research and development with regard to optical character recognition (OCR) and image processing, the present inventors are not aware of any tools that currently map standardized graphics in the biology domain to machine readable/interpretable format, and which use information from the standardized graphics to develop editable and modifiable views of the underlying biological models. The bulk of the focus in research in the OCR and image processing fields has been concentrated around scanners and other systems, with attempts being made to convert information present in paper form to a concise digital form. For example, the goal of these systems is to reduce the amount of space required to store the scanned document. An image of the page with text would require many orders more storage space than the ASCII representation of the text therein. Thus, these systems are faced with a very general problem, the scanned document may have any font, any combination of images and graphics, may not be properly oriented, etc. In other terms, the input to such systems may not be restricted.
U.S. Pat. No. 5,522,022 to Rao et al., discloses an image analysis technique which is generally applicable for analyzing images such as directed graphs, undirected graphs, trees, flow charts, circuit diagrams, and state-transition diagrams that show node-link structures. Because of its general applicability, the technique is designed to resolve ambiguity that exists across various types of graphical representations, e.g., labels resembling nodes or links, other characters or lines which may be confused with links or nodes, and the like. Because this technique does not begin with any predefined parameters with regard to an image type, it is geared toward resolving these ambiguities and identifying nodes and links that exist in a diagram. To do this, the technique first identifies “likely node-link data” indicating parts of an input image that are likely to contain a link and/or node. Likely node-link data are data from those parts of the image which satisfy a constraint on nodes and those parts which satisfy a constraint on links. The likely node-link data are then used to define constrained node-link data indicating subsets of the likely nodes and links that satisfy a constraint on node-link structures. The constrained node-link data are obtained by iteratively applying a link nearness criterion to the likely nodes and a node nearness criterion to the likely links until stability is reached. This approximation approach is necessitated by the lack of a standardized format of the images being processed, as well as the need to make the technique generally applicable to many types of graphical images.
There remains a need for tools and methods that efficiently facilitate the use of biological models beyond the ability to simply visually compare such models with other information and data, and which would enable a user to not only view various biological models, but to also supplement, edit or otherwise modify these models.