As modern society evolves, more and more textual information is being stored as electronic data. The textual information can be derived from electronic copies of scientific articles, patents, web pages, electronic books, among others. Textual information can also be included in graphics or metadata associated with images and videos.
Accordingly, the amount of textual electronic data continues to grow. Some say the growth is exponential. For example, MEDLINE, which is a database of medical publications from more than 4600 journals, currently has more than a total of 16 million documents. Literature is an important source of knowledge and, in certain fields (such as medicine), it is the only comprehensive source. In the field of molecular biology, for example, protein interactions play a significant role. The evidence for a majority of protein interactions is present only in the literature. On the one hand, literature is an important part of research for biomedical researchers, and on the other hand, the researchers are overwhelmed with the number of documents being published. This rapid growth of data, while being overwhelmingly encouraging for researchers, creates a number of new problems. For example, large quantities of data can be hard for any group of people to sort and classify, let alone one person read and understand. In other words, it is becoming ever more difficult for any one person or even groups of people to keep abreast of all the developments in a just a single field or with respect to a particular subject matter or event.
There are currently many systems and methods available that facilitate the capture and creation of this data. The data is often then stored in a database structure or a spreadsheet, which are collectively referred to herein as databases or information sources regardless of the type, system, or structure. The data stored in databases can be formatted in a variety of ways, including the XML format.
In addition, people commonly use other systems specifically designed to search for and retrieve data from the databases. These searching systems are often referred to as search engines. Some well known examples of search engines include Google® of Google Corporation of Mountain View, Calif., Baidu® of Baidu.com Corporation of Beijing, China, Yahoo!® and AltaVista® of Yahoo! Corporation of Sunnyvale, Calif., MSN® and Windows Live® of Microsoft Corporation of Redmond, Wash.
More recently, researchers have begun focusing their attention on other types of systems, called information extraction systems. Such systems and methods involve the extraction of information from natural language text. Information, as referenced in these systems, is comprised of entities and relationships between the entities. For the information extraction systems to operate, the systems must have access to information in a compatible electronic format. Web pages and other electronic documents are examples of where electronic versions of such information can be found today. For example, the website www.Pubmed.gov can be used to access a database of electronic medical information. As another example, commonly-assigned, International Application No. PCT/US08/60984, filed Apr. 21, 2008, discusses how to infer information and profile knowledge based on the associative relevancy of information to other information. Accordingly, International Application No. PCT/US08/60984 is incorporated by reference herein in its entirety, especially the portions regarding inferring information and knowledge profiling architectures that permit the capture and use of associations between known reference sets.
In many instances, information extraction can be generally described as ways of creating a structured database from natural language text documents, so that traditional knowledge discovery and data mining techniques can be efficiently applied. Accordingly, information extraction is complicated by the fact that natural languages are flexible, and words can take on different meanings depending upon their context (i.e., how the words are used in a sentence, paragraph and/or document). This problem is further compounded by the fact that the same thing can be said in different ways.
Some information extraction systems require a user to provide an initiating concept (concept “A”). This system is sometimes referred to as an open discovery system. An open discovery system retrieves the document set previously categorized by a user (i.e., manually) as being related to concept A, by querying a database (such as the one accessed via, e.g., www.Pubmed.gov). The open discovery system then tries to compute the set of frequently occurring concepts (concepts B) from the document set, and then retrieves the document set for each of the frequent B terms by querying the database to compute the set of C terms. One limitation to the open discovery system is a lack of scalability if a user is interested in hypotheses related a broad category, like “Disease or Syndrome,” or a large group of users exist, because such instances require sending several hundreds of queries for each initiating concept. Another, more apparent limitation is that the set of hypotheses generated by the system are limited to only the initiating concepts A by a group of users. In other words, an open system may omit some interesting hypotheses, because it does not have one of the initiating concepts.