A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the xerographic reproduction by anyone of the patent document or the patent disclosure in exactly the form it appears in the U.S. Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates to the field of information extraction and storage and more specifically to techniques for managing a distributed information acquisition and information storage process.
There has been and will continue to be an explosion in the volume and complexity of information available to information consumers. However, due to the magnitude of disparate information available in the public domain, information consumers are typically able to access, comprehend, and meaningfully use only a very small percentage of the available information. This is primarily because the information is typically buried in articles which may be contained in magazines, journals, papers, newspapers, books, notebooks, etc. or is stored in digital format in information stores such as databases, digital libraries, etc. Unless otherwise stated, the term xe2x80x9carticlexe2x80x9d as used in this application should be construed to include any transcribed or printed information, or information available in digital format, or combinations or portions thereof. The information in an article may include text, graphics, charts, audio information, video information, multimedia information, and other types of information in various formats. An article may be published or unpublished. Since these articles could number in the hundreds and thousands, they cannot all be accessed, read, and understood by an information consumer in a practical timeframe. While several data warehousing techniques have been used to integrate information from various articles, these techniques are not flexible enough to keep up with the proliferation of available information. They also rarely help with the information overload problem. In fact, by aggregating data, these data warehousing techniques often make the information overload problem worse.
One field that has seen a tremendous explosion of information in the past decade is the life sciences field which has benefited from the exponential growth in the identification and functional characterization of genes in the biological sciences. A decade ago a laboratory notebook was often sufficient for xe2x80x9cdata warehousing.xe2x80x9d A researcher could rely on his or her deep understanding of a handful of genes to make informed decisions regarding his or her research. Today, the influx of information and the blurring of traditional biological research boundaries have outstripped the ability of a researcher to fully assimilate, synthesize, and evaluate research data. The primary impediment for a researcher is not the lack of information; rather it is the large quantity and unstructured format used to store the information. To evaluate results of large-scale experiments, researchers rely heavily on published research literature to identify the key information that is critical for them to make informed decisions. The vast number of articles, the unstructured format of the information, and the inability of the researchers to query on specific experimental results dictates that the review of the literature may take several days, weeks, or even more of a researcher""s time. In addition to being very time intensive, the accumulation of knowledge by the researcher is not easily transferable to other researchers because it is not in an easily accessible format.
Based on the above, there is a need for techniques which can extract information from the various sources and store it in a format which can be easily accessed or queried by an information consumer. It is also desirable that the techniques be flexible enough to keep pace with the proliferation of information. Further, it is also desirable that the techniques be adaptable to extract and store information related to various domains and fields.
The present invention discusses techniques for extracting information from a plurality of articles and for storing the extracted information in an information store. According to an embodiment, the present invention identifies a plurality of articles from which information is to be extracted. The present invention also identifies a plurality of information extractors for extracting information from the plurality of articles. A database is provided for storing information related to the plurality of articles and the plurality of information extractors. According to this embodiment, the present invention assigns the plurality of articles to the plurality of information extractors for information extraction. The present invention receives information extracted by an information extractor from an article assigned to the information extractor. The extracted information is then stored in the information store.
According to an embodiment of the present invention, the information store is a knowledge base which is configured to store the extracted information according to an ontology. In this embodiment, information may be extracted from articles using a fact-based model.
According to another embodiment, the present invention enables quality control processing to be performed on the information extracted by the information extractor before the extracted information is stored in the information store. According to this embodiment, the present invention enables a content reviewer to review the extracted information received from the information extractor. The present invention may receive information from the content reviewer identifying errors associated with the extracted information.
According to an embodiment, the present invention determines, from the information received from the content reviewer, an error count indicating number of errors in the extracted information received from the information extractor. If the error count is above a threshold error count level, the article may be reassigned to the information extractor for information extraction. If the error count is equal to or below the threshold error level, the present invention may provide services enabling the content reviewer to change the extracted information received from the information extractor to correct the errors.
According to another embodiment, the present invention calculates the compensation due to information extractors for extracting information from the articles. The compensation amount for an information extractor may be calculated based on several criteria such as the number of errors in the information extracted by the information extractor, a quality score assigned to the article, and other metrics information captured during quality control processing.
According to yet another embodiment, the information store is configured to store the extracted information according to an information model. In this embodiment, the present invention allows reviewers to review the extracted information and make changes, if any, to the information model to accommodate the extracted information. In this embodiment, the present invention may allow a reviewer to review the extracted information and new concepts introduced by the extracted information and to provide information identifying changes, if any, to be made to the information model. According to a specific embodiment, the information provided by the reviewer may then be reviewed by a second reviewer. After the second reviewer has approved of the changes, the information model may be changed. In a specific embodiment, the information store is a knowledge base which is configured to store the extracted information according to an ontology. The present invention provides services enabling ontologists to review new concepts and to make changes to the ontology to accommodate the new concepts. Other information models may also be used in conjunction with the present invention.