1. Field of the Invention
The invention relates to the field of interactive fuzzy-logic searches of string databases using a computer.
2. Description of the Prior Art
Traditional information systems return answers to a user after the user submits a complete query. Users often feel “left in the dark” when they have limited knowledge about the underlying data or the entities they are looking for, and have to use a try-and-see approach for finding information. A recent trend of supporting autocomplete in these systems is a first step towards solving this problem.
In information systems, often users cannot find their interesting records or documents because they don't know the exact representation of the entity in the backend repository. For example, a person Dame “John Lemmon” could be mistakenly represented as “Johnny Lemon”. A person called “William Gates” could be stored as “Bill Gates.” As a result, a user might not be able to find the records if he doesn't know the exact spelling of the person he is looking for. The same problem exists especially for the case where data came from various Web sources, which tends to be heterogeneous and noisy.
What is needed is a method which allows information systems to answer queries even if there is a slight mismatch between the user query and the physical representation of the interesting entities.
What is needed is a method that can also allow users to interactively query the repository as they type in their query.
In order to give instant feedback when users formulate search queries, many information systems are supporting autocomplete search, which shows results immediately after a user types in a partial query. As an example, almost all the major search engines nowadays automatically suggest possible keyword queries as a user types in partial keywords. Both Google Finance and Yahoo! Finance support searching for stock information as users type in keywords. Most autocomplete systems treat a query with multiple keywords as a single string, and find answers with text that matches the string exactly. As an illustrative example, consider the search box on the home page of Apple Inc. that supports autocomplete search on its products. Although a keyword query “itunes” can find a record “itunes wi-fi music store”, a query with keywords “itunes music” cannot find this record (as of June 2009), simply because these two keywords appear at different places in the record.
To overcome this limitation, a new type-ahead search paradigm has emerged recently. A system using this paradigm treats a query as a set of keywords, and does a full-text search on the underlying data to find answers including the keywords. An example is the CompleteSearch system on DBLP2, which searches the underlying publication data “on the fly” as the user types in query keywords letter by letter. For instance, a query “music sig” can find publication records with the keyword “music” and a keyword that has “sig” as a prefix, such as “sigir”, “sigmod”, and “signature”. In this way, a user can get instant feedback after typing keywords, thus can obtain more knowledge about the underlying data to formulate a query more easily.
A main challenge in fuzzy type-ahead search is the requirement of high efficiency. To make search interactive, from the time the user has a keystroke to the time the results computed from the server are displayed on the user's browser, the delay should be as small as possible. An interactive speed requires this delay be within milliseconds. Notice that this time includes the network transfer delay, execution time on the server, and the time for the browser to execute its javascript (which tends to be slow). Providing a high efficiency is especially important since the system needs to answer more queries than a traditional system that answers a query only after the user finishes typing a complete query.
The problem includes how to answer ranking queries in fuzzy type-ahead search on large amounts of data. Consider a collection of records such as the tuples in a relational table. As a user types in a keyword query letter by letter, we want to on-the-fly find records with keywords similar to the query keywords. We treat the last keyword in the query as a partial keyword the user is completing. We assume an index structure with a trie for the keywords in the underlying data, and each leaf node has an inverted list of records with this keyword, with the weight of this keyword in the record. As an example, Table IV shows a sample collection of publication records. For simplicity, we only list some of the keywords for each record. FIG. 9 shows the corresponding index structure.
Suppose a user types in a query “music icde li”. We want to find records with keywords similar to these keywords, and rank them to find the best answers. To get the answers, for each query keyword, we can find keywords similar the query keyword. For instance, both the keywords “icdt” and “icde” are similar to the second query keyword. The last keyword “li” is treated as a prefix condition, since the user is still typing at the end of this keyword. We find keywords that have a prefix similar to “li”, such as “lin”, “liu”, “lui”, “icde”, and “icdt”. We access the inverted lists of these similar keywords to find records and rank them to find the best answers for the user.
A key question is: how to access these lists efficiently to answer top-k queries? In the literature there are many algorithms for answering top-k queries by accessing lists. These algorithms share the same framework originally proposed by Fagin, in which we have lists of records sorted based on various conditions. An aggregation function takes the scores of a record from these lists and computes the final score of the record. There are two methods to access these lists: (1) Random Access: Given a record id, we can retrieve the score of the record on each list; (2) Sorted Access: We retrieve the record ids on each list following the list order.