1. Field of the Invention
The present invention pertains generally to the field of computer-implemented Information Retrieval (IR) including web browsing and, more particularly, to information retrieval using metadata. The present invention further pertains to the indexing operation of a search engine.
2. Description of Related Art
IR is the science of searching for documents, for information within documents and for metadata about documents, as well as searching relational databases and searching of the World Wide Web. IR is interdisciplinary and is based on computer science, mathematics, cognitive psychology, linguistics and statistics among other disciplines. Many universities and public libraries use IR systems to provide access to books, journals and other documents.
Web search engines are the most visible IR applications. A search engine generally operates, in the order of web crawling, indexing and searching. The purpose of storing an index is to optimize speed and performance in finding relevant documents in response to a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval. Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Popular engines focus on the full-text indexing of online, natural language documents. Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs and agent-based search engines index in real time.
While search indices of most internet access and search engine access providers are proprietary, some companies, such as Yahoo! Corporation of Santa Clara, Calif., permit access to a version of its search index. Build your Own Search Service (BOSS) is an initiative by Yahoo!™ to provide an open search web services platform. The goal of BOSS is to give developers free access to the search index of this internet access provider.
A related area is the field of Latent Semantic Indexing (LSI). LSI is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. The method is called LSI because of its ability to correlate semantically related terms that are latent in a collection of text. The method, also called Latent Semantic Analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries. Queries against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results don't share a specific word or words with the search criteria. Because LSI uses a strictly mathematical approach, it is inherently independent of language. This enables LSI to elicit the semantic content of information written in any language without requiring the use of auxiliary structures, such as dictionaries and thesauri.
LSI is not restricted to working only with words. It can also process arbitrary character strings. LSI uses common linear algebra techniques to learn the conceptual correlations in a collection of text. In general, the process involves constructing a weighted term-document matrix, performing a SVD on the matrix, and using the matrix to identify the concepts contained in the text. LSI begins by constructing a term-document matrix, X, to identify the occurrences of the m unique terms within a collection of n documents. In a term-document matrix, each term is represented by a row, and each document is represented by a column, with each matrix cell, xij, initially representing the number of times the associated term appears in the indicated document. This matrix is usually very large and very sparse. Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data. For example, the weighting functions transform each cell, xij of X, to be the product of a local term weight, which describes the relative frequency of a term in a document, and a global weight, which describes the relative frequency of the term within the entire collection of documents. Dynamic clustering based on the conceptual content of documents is one of the uses of LSI. Clustering is a way to group documents based on their conceptual similarity to each other without using example documents to establish the conceptual basis for each cluster. This is useful when dealing with an unknown collection of unstructured text.
Both IR and LSI are established and actively researched fields.
In IR, retrieving documents from a corpus of data in response to a user query is a known process, around which a great deal of literature exists. Literature in the IR field can be broadly broken into four categories:
1. Text matching algorithms, which estimate the relevancy of a certain term to a certain document, treating each term and document in isolation. The term frequency, inverse document frequency algorithm, as described in Salton and McGill, 1983, “Introduction to modern information retrieval,” is one example of text matching algorithms.
2. Document importance algorithms, which consider domain specific heuristics to estimate the documents the user would find to be of higher value independent of text matching algorithms. Sergey Brin and Larry Page's seminal paper, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” covers PageRank™ that is one of the algorithms in this category. PageRank™ is a registered trademark of Google, Inc. of Mountain View, Calif.
3. Semantic association algorithms, which infer associations between and amongst terms and documents in order to extract meaning from the corpus. U.S. Pat. No. 4,839,853, “Computer information retrieval using latent semantic structure,” presents one such method.
4. Search using user-generated metadata, which can be further subdivided into: (a) weighted search algorithms and (b) dynamic re-writing.
Weighted search algorithms, in category 4(a), pre-define a set of search topics, and assign each user a vector which designates user's interest level in each of these topics. Searches are effectively performed in parallel across several vertical search engines, one per topic, and the user's interest vector determines the weight given to results from each of these topic-specific searches. Qiu and Cho's paper, “Automatic Identification of User Interest for Personalized Search” describes this arrangement in detail. This paper studies how a search engine can learn a user's preference automatically based on her past click history and how it can use the user preference to personalize the search results.
Dynamic re-writing, category 4(b), builds a richer model of user behavior, and uses this information to first disambiguate and refine search queries, and second to re-rank the search results. An intermediary layer is added in between the user and the search engine that adds differentiating terms to the query, and dynamically re-orders the search results as returned by the search engine. This query-time method is useful for query disambiguation at a coarse level, for example disambiguating between jaguar the animal and Jaguar™ the car manufacturer, but as it exists and operates outside the search engine, the method fails to capture, represent or use the rich index-time context. Jaguar™ is a registered trademark of Jaguar and Rover North America LLC of Mahwah, N.J.
Measuring of user interest is another known area of technology. Both implicit tools, as summarized in D. Kelly and J. Teevan's “Implicit Feedback for Inferring User Preference: A Bibliography,” and explicit tools such as StumbleUpon, which allow users to manually rank pages, are used to measure the user interest. StumbleUpon is a proprietary freeware identifying an Internet community that allows its users to discover and rate web pages, photos, and videos. It is a personalized recommendation engine that uses peer and social-networking principles. StumbleUpon chooses which web page to display based on the user's ratings of previous pages, ratings by his/her friends, and by the ratings of users with similar interests. Users can rate a web page with a thumbs-up or thumbs-down.
Suggesting potential “friends” and contacts based on user interests is another known technique, employed by services such as Last.fm™ which is a registered trademark of CBS Interactive Inc., of San Francisco, Calif.
Inferring direct user needs and tracking goals is another known area, covered in papers such as Chi, Pirolli, Chen and Pitkow's “Using Information Scent to Model User Information Needs and Actions on the Web.” In this paper, the authors describe two computational methods for understanding the relationship between user needs and user actions. First, for a particular pattern of surfing, they seek to infer the associated need. Second, given an information need, and some pages as starting points, they attempt to predict the expected surfing patterns. The algorithms use a concept called “information scent,” which is the subjective sense of value and cost of accessing a page based on perceptual cues.