Field of the Invention
The invention is related to information technology and, more particularly, to a search engine that utilizes the results of knowledge correlation to identify network and/or Internet resources significant to any given user question, subject, or topic of a digital information object.
Description of the Related Art
Search engines are widely acknowledged to be part of the Information Retrieval (IR) domain of knowledge. IR methods are directed to locating resources (typically documents) that are relevant to a question called a query. That query can take forms ranging from a single search term to a complex sentence composed in a natural language such as English. The collection of potential resources that are searched is called a corpus (body), and different techniques have been developed to search each type of corpus. For example, techniques used to search the set of articles contained in a digitized encyclopedia differ from the techniques used by a web search engine. Regardless of the techniques utilized, the core issue in IR is relevance—that is, the relevance of the documents retrieved to the original query. Formal metrics are applied to compare the effectiveness of the various IR methods. Common IR effectiveness metrics include precision, which is the proportion of relevant documents retrieved to all retrieved documents; recall, which is the proportion of relevant documents retrieved to all relevant documents in the corpus; and fall-out, which is the proportion of irrelevant documents retrieved to all irrelevant documents in the corpus. Post retrieval, documents deemed relevant are (in most IR systems) assigned a relevance rank, again using a variety of techniques, and results are returned. Although most commonly the query is submitted by—and the results returned to—a human being called a user, the user can be another software process.
Text retrieval is a type of IR that is typically concerned with locating relevant documents which are composed of text, and document retrieval is concerned with locating specific fragments of text documents, particularly those documents composed of unstructured (or “free”) text.
The related knowledge domain of data retrieval differs from IR in that data retrieval is concerned with rapid, accurate retrieval of specific data items, such as records from a SQL database.
Information extraction (IE) is another type of IR which is has the purpose of automatic extraction of information from unstructured (usually text) documents into data structures such as a template of name/value pairs. From such templates, the information can subsequently correctly update or be inserted into a relational database.
Search engines that have been described in the literature or released as software products use a number of forms of input, ranging from individual keywords, to phrases, sentences, paragraphs, concepts and data objects. Although the meanings of keyword, sentence, and paragraph conform to the common understanding of the terms, the meanings of phrase, concept, and data object varies by implementation. Sometimes, the word phrase is defined using its traditional meaning in grammar. In this use, types of phrases include Prepositional Phrases (PP), Noun Phrases (NP), Verb Phrases (VP), Adjective Phrases, and Adverbial Phrases. For other implementations, the word phrase may be defined as any proper name (for example “New York City”). Most definitions require that a phrase contain multiple words, although at least one definition permits even a single word to be considered a phrase. Some search engine implementations utilize a lexicon (a pre-canned list) of phrases. The WordNet Lexical Database is a common source of phrases.
When used in conjunction with search engines, the word concept generally refers to one of two constructs. The first construct is concept as a cluster of related words, similar to a thesaurus, associated with a keyword. In a number of implementations, this cluster is made available to a user—via a Graphic User Interface (GUI) for correction and customization. The user can tailor the cluster of words until the resulting concept is most representative of the user's understanding and intent. The second construct is concept as a localized semantic net of related words around a keyword. Here, a local or public ontology and taxonomy is consulted to create a semantic net around the keyword. Some implementations of concept include images and other non-text elements.
Topics in general practice need to be identified or “detected” from a applying a specific set of operations against a body of text. Different methodologies for identification and/or detection of topics have been described in the literature. Use of a topic as input to a search engine therefore usually means that a body of text is input, and a required topic identification or topic detection function is invoked. Depending upon the format and length of the resulting topic, an appropriate relevancy function can then be invoked by the search engine.
Data objects as input to a search engine can take forms including a varying length set of free form sentences, to full-length text documents, to meta-data documents such as XML documents. The Object Oriented (OO) paradigm dictates that OO systems accept objects as inputs. Some software function is almost always required to process the input object so that the subsequent relevance function of the search engine can proceed.
Ranked result sets have been the key to marketplace success for search engines. The current dominance of the Google search engine (a product of Google, Inc.) is due to far more to the PageRank system used in Google that lets (essentially) the popularity of a given document dictate result rank. Popularity in the Google example applies to the number of links and to the preferences of Google users who input any given search term or phrase. These rankings permit Google to optimize searches by returning only those documents with ranks above a certain threshold (called k). Other methods used by web search engines to rank results include “Hubs & Authorities” which counts links into and out of a given web page or document, Markov chains, and random walks.