Over the past several decades, there has been an explosion in the volume and complexity of information available to information consumers. As a result, there is now a large amount of disparate information available in the public domain. Some of this information is buried in, for example, magazines, journals, papers, newspapers, books, textbooks, notebooks, etc. Other information is stored in many different types of digital formats and in many different types of information stores, such as databases, digital libraries, etc.
One field that has seen a tremendous explosion of information over the past several decades is the life sciences field. The primary impediment for a researcher is now not the lack of information, but, rather, the large quantity of information and the unstructured formats used to store that information. For example, a chemical researcher who may wish to search through the above-mentioned sources of information for chemical structures, chemical substructures, and/or chemical reactions of interest faces a daunting task.
A number of computerized technologies exist to aid a chemical researcher in completing this task; however, they are generally of limited value. For example, some technologies are able to search through specific types of documents, such as Microsoft Word documents, for specific types of images, such as object linked and embedded (OLE) images, of molecules. Typically, however, these technologies are unable to systematically process multiple heterogeneous types of documents. In addition, by searching for only specific image types, the approach taken by these technologies is crude and generally results in many molecule images being missed.
As such, a need exists for improved systems, methods, and apparatus for processing documents to identify structures, such as chemical structures, contained therein.