Much of the world's information or data is in the form of text and the majority of that is unstructured (without metadata) text. Much of this textual data is available in digital form (either originally created in this form or somehow converted to digital—by means of OCR (optical character recognition), for example). This text is being stored and made available via the Internet or other networks. All of these advances have made it possible to investigate, retrieve, extract and categorize information contained in vast repositories of documents, files, or other text “containers.” This proliferation of documents available in electronic form and techniques resulted in a need for tools that facilitate wading through the ever-increasing expanse of documents. One such tool is information extraction software that, typically, involves applying a text analysis process to electronic documents written in a natural language and populating a database with information extracted from such documents. Applied against a given a textual document, the process of information extraction (IE) is used to identify entities of predefined types appearing within the text and listing them. IE may also be applied to extract other words or terms or strings of words or phrases.
“Term” refers to single words or strings of highly-related or linked words or noun phrases, e.g., “New York Stock Exchange,” “free market,” “President Bush” or “health program.” “Term extraction,” or term recognition or term mining, is a type of IE process used to identify or find and extract relevant terms from a given document that appear in, and therefore have some relevance, to the content of the document. Techniques employed in term extraction may include linguistic or grammar-based techniques, natural language or pattern recognition, tagging or structuring, data visualizing and predictive formula. For example, all names of companies mentioned in the text of a document can be identified, extracted and listed. Similarly, names of people, products, countries, organizations, geographic locations, etc., are additional examples of “entity” type terms that are identified and included on a list. This IE process may be referred to as “named entity extraction” or “named entity recognition.” There are a variety of methods available for automatic named entity extraction, including linguistic or semantic processors to identify, based on known terms or applied syntax, likely noun phrases. Filtering may be applied to discern true entities from unlikely entities. The output of the IE process is a list of the entities of each type and may include pointers to all occurrences or locations of each entity in the text. The IE process does not rank the entities. Thus, suppose a document on the merger of AOL and Time Warner happens to also mention Sony. Applying IE to this document, all three companies (AOL, Time Warner and Sony) would be listed identically, even though Sony is clearly less “central” or “relevant” to this text. Often the terms are then compared against a collection of documents or “corpus” to determine relevancy of the term to the document.
These tools allow businesses to discover relevant information buried in massive volumes of text-based materials, e.g., documents, emails, letters, articles, and books, thereby making it possible for businesses or users of such tools to identify and group relevant information and to make knowledge-based decisions. Tools that extract information that may not otherwise be discernible benefit many entities. Such entities include media and other content based concerns, information technology delivery concerns, professional services and resource providers, searching concerns, and in particular by researchers, professionals, executives, marketing analysts, campaign strategists, and others involved with such concerns. For example, a news service can use intelligent agents to monitor feeds and to automatically perform IE functions on the processed information. Predefined search terms or schema may be applied on such information to rapidly identify and deliver new items or articles of interest that satisfy some search or other user-defined criteria.
Examples of Information Extraction software include OpenCalais from Thomson Reuters; AlchemyAPI; CRF++; LingPipe; TermExtractor; TermFinder; and TextRunner. IE may be a separate process or a component or part of a larger process or application, such as business intelligence software. For instance, IBM has a business intelligence solution, Intelligent Miner For Text, that includes an information extraction function which extracts terms from unstructured text. Additional features include clustering, summarization, and categorization. These features analyze, for example, data accessible online or stored in traditional files, relational databases, flat files, and data warehouses or marts. Functions may include statistical analysis and mining techniques such as factor analysis, linear regression, principal component analysis, univariate curve fitting, univariate statistics, bivariate statistics, and logistic regression.
One rudimentary method of determining relevancy within a single document is simple count score, i.e., the number of times a term appears in a document. This is of limited value. What is needed is a more sophisticated way to determine relevancy or importance of terms within a document based on the content of the document itself and to apply a more effective way to assign a degree of importance or relevancy of individual terms to the document.
Search engines retrieve documents in response to search terms. To this end, search engines may compare the frequency of terms that appear in one document against the frequency of those terms as they appear in other documents within the collection or corpus. This aids the search engine in determining respective “importance” of the different terms within the document, and thus determining the best matching documents to the given query. One method for comparing terms appearing in a document against a collection of documents is called Term Frequency-Inverse Document Frequency (TFIDF). In this method a percentage of term count as compared to all terms within a subject document is assigned (as a numerator) and that is divided by the logarithm of the percentage of documents in which that term appears in a corpus (as the denominator). More specifically, TFIDF assigns a weight as a statistical measure used to evaluate the importance of a word to a document in a collection of documents or corpus. The relative “importance” of the word increases proportionally to the number of times or “frequency” such word appears in the document. The importance is offset or compared against the frequency of that word appearing in documents comprising the corpus. TFIDF is expressed as the log (N/n(q)) where q is the query term, N is the number of documents in the collection and N(q) is the number of documents containing q. TFIDF and variations of this weighting scheme are typically used by search engines, such as Google, as a way to score and rank a document's relevance given a user query. Generally for each term included in a user query, the document may be ranked in relevance based on summing the scores associated with each term. The documents responsive to the user query may be ranked and presented to the user based on relevancy as well as other determining factors.