1. Field of the Invention
The embodiments of the invention generally relate to information retrieval systems, and, more particularly to techniques for data searching in full text inverted list information retrieval systems.
2. Description of the Related Art
A taxonomy is a classification of things. For example, the well-known directory structure in most operating systems is a method to organize individual files into groups. In a full text index, the indexing takes advantage of the fact that many documents share identical tokens (e.g., words or characters). An inverted list index generally only stores each unique token once while the token may occur several times in the original set of documents. Therefore, an inverted list index can generally be seen as a form of compressing the set of documents. Typically, the compression ratio depends on the scope of the index. Generally, a basic inverted index simply records whether a term occurs within a document, but not how many times or where it occurs. A full inverted index typically records every occurrence of every token within every document. While a basic inverted index is more compact in terms of storage, it generally cannot support searches for sequences of tokens, or the existence of tokens within a certain window of tokens. However, a full inverted index generally allows such sophisticated searches. Between, a basic inverted index and a full inverted index, there are various levels of information that can be stored within an inverted list for a term.
With respect to inverted lists, one of the most well-known forms of an index is an index in a book. Almost every book has a generally alphabetical listing of words or sequences of words (e.g., section and chapter headers) at the end of the book, along with page numbers where they are discussed. Using an index, one can avoid doing a page-by-page scan to find pages that contain certain words. Similarly, an inverted list index in the context of information retrieval applications such as web search engines does exactly that. Abstractly, the web can be analogized as a book, and individual web documents represent the pages in the book. Building an inverted list index is performed by scanning all documents to be indexed and splitting them into tokens. This process, called parsing or tokenization, produces tokens that can be words on an English text document, Chinese characters, 4 byte numbers, etc.
A query against a full text index is the same as the intersection/join (depends on query operators, e.g., OR, AND), of the inverted lists of all the query terms. The query result is therefore an inverted list itself. For each term of the query, an inverted list generally has to be accessed. The process of data mining involves extracting information such as patterns, relationships, etc. from a large corpus of data. Data miners (so-called annotators) typically operate on the corpus, usually document-by-document, and add metadata to the corpus. An entity can be understood as something that one refers to with many names or descriptions. An entity can be a person, an institution, an organization, a building or a country. All of these have in common the notion that the same thing can be described in different languages, with different names or nicknames or varying short forms of their names. Therefore, an entity can also be generally expressed as a search query.
The above concepts allow users to search for bags of words or mined entities. However, often times this is not sufficient. Computer users typically have the tendency to organize and group things together. Examples are file systems which use directories to group related files or mailing lists which group email addresses together. The basic idea is that an operation can be performed on a group of things by referring to a single alias (i.e., the directory name or the name of the mailing list).
In a search application, a similar functionality is desirable. Instead of searching for documents that contain a group of specific terms; it is generally more efficient to index and search for the group using an alias. For example, all occurrences of politicians' names in documents may be grouped using a single term “politicians”. That way, one can efficiently search a corpus of documents without having to list all politicians individually. When searching for a group of things, it is generally not only useful to find documents that match the group; it is also useful to know which entity is “hidden” behind an occurrence of the group name.
A first conventional solution to this problem is to query for a group such as “politicians” by querying individually for each politician in the group. However, this is generally unacceptable since the group may contain thousands, millions, or in some cases hundreds of millions of entries (for example, the group of all people's names), and the processing time in such a case can move from fractions of a second to days.
A second conventional solution is to create a new token corresponding to the group. However, this solution generally fails to provide important functionality. The user knows that documents in the result set reference a politician, but does not know which politician. For certain analytic applications, this approach is also unacceptable. Relational databases are well-established tools for storing relational data. The containment of an entity in a group is a relation as well. However, relational databases are generally not suitable for building large scale text indices. Accordingly, there remains a need for a novel indexing technique that is capable of finding documents that contain entities that belong to the group and to find out which entity is “hidden” behind an occurrence of the group name