1. Field of the Invention
This invention relates to the field of data mining a distributed source of data.
2. Background
Some World Wide Web search engines currently index over a billion web pages. These pages contain information about almost anything of interest to humanity. However, one problem is that an individual who is trying to make sense of an entire body of knowledge often finds it difficult to quickly find the information of interest.
Web Mining systems, next-generation search engines, and Online Shopping tools are examples of attempts to bring relevant information to a user. There is related work in each of these three areas. Mapping vendor spaces using high-level relations, Doug Bryan, First SIAM Int'l Conference on Data Mining, Chicago, Apr. 7, 2001 pages 59-62, describes a Web Mining system that will find vendors (manufacturers) that appear to be related to a given manufacturer. Bryan's system sends queries to a variety of on-line services that produce lists or links or lists of company names (for example, the GOOGLE search engine and ALTAVISTA search engine have a related pages feature; and these and other search engines organize links into directories). In addition, news stories can be thought of as a list of company names where you can consider them related if they are in the same story. Thus, you can extract related company names from each story. Finally, finance portals like CNBC services, HOOVER'S services, and QUICKEN services provide written profiles that list related companies. Bryan's technique then combines this evidence that companies are related to form a list of companies that are most related to a given candidate phrase.
Our approach to finding brands, breeds or other search type selections of a category term is different from Bryan's approach in that we start with a search type selection (for example, breed, brand, or some other search type) and a category term (for example, a generic product), not a company name, and find candidate phrases (for example, a set of brands) related to the category term.
A paper, Learning to Understand Information on the Internet: An Example—Based Approach, Merkowitz et al., Journal of Intelligent Information Systems, Vol., 8, No. 2, pages 133-153, March 1997 describes the ShopBot and ILA programs. The ShopBot program learns how to use special-purpose search engines found at many on-line vendor sites. It then uses the vendor sites to extract information, such as selling price, for a user-specified product model, from several vendor sites. The ShopBot program helps users find detailed information, once they already know product models.
Aspects of the invention use general-purpose search engines to find candidate phrases, given a user-specified category term and search type selection. Aspects of the invention help users make sense of an entire product space.
Learning to extract symbolic knowledge from the World Wide Web, Craven et al., Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI 1998), pages 509-516 and Information extraction from HTML: Application of a general machine learning approach, Freitag, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI 1998), pages 517-523 disclose a system that populates a knowledge base using information from the World Wide Web.
Freitag's approach is quite different from the approach disclosed herein as Freitag does not use search engine results. In addition, Freitag's approach requires labeled training data (labeled web pages and labeled links). The information-extraction system, SRV, described in these papers does include a sub-technique for placing constraints on phrases. For example, SRV does have rules requiring a word to be capitalized, numeric, all upper case, or all lower case. However, unlike aspects of the invention, no rules are described limiting what specific characters may or may not be present (except that numeric implies a set of digits), nor how many of them may be present.
Etzioni's KnowItAll search engine (under development at the University of Washington) uses a linguistic approach to find data on the World Wide Web and collates it in the form of a list. Because KnowItAll extracts phrases from sentences based on the linguistic role of the phrase and on surrounding words, it is unable to discover phrases in structures other than sentences (for example, phrases in a bulleted list or phrases in a table). In addition, KnowItAll does not account for punctuation between the words of the phrase nor does it verify the correctness of a phrase such as by performing a targeted-site network search.
U.S. Pat. No. 6,678,681 B1 issued to Brin on Jan. 13, 2004, entitled Information Extraction from a Database, discloses techniques for extracting information from a database. Tuples of information are searched for, the result of the search is analyzed for a pattern, and then additional tuples of information are searched for in the database that follow the pattern. Brin's technique starts with example strings and searches through a pre-determined collection of documents. Brin's technique looks for tuples of information, such as (author, title) pairs, and sorts found tuples based on what text occurs between the elements of each tuple and the order in which they occur. Brin's technique learns and discovers patterns in the text and tags that immediately precede, follow, and divide, the tuples found so far. If good patterns are found, the algorithm will work well. If not, it can diverge. Whereas Brin's method evaluates each tuple based on the number of patterns that it matches, this is a fairly unstable and unreliable metric. While Brin's technique does examine the URL to see if its host name matches the candidate phrase, it does so only to see if the URL is similar to other URLs in which matching tuples have been found. Furthermore, Brin's technique does not accept any information about the desired category of items, but only examples from the category. So, even if it could accept 1-tuples, like “SONY” and “TOSHIBA”, it could not know whether to converge on brands of “DVD player” or “notebook computers”, for example. This makes Brin's algorithm susceptible to drift, for example from books into articles. Finally, Brin does not teach how to correct tuples based on additional evidence.
It would be advantageous to have an automatic algorithm for discovering lists of brands, breeds, and other classifications starting with no other information than the type of search type selection and a category term. In addition, it would be advantageous to be able to receive a partial list of candidate phrases (that may include incorrect data) and a category term and to expand and correct the list. Furthermore, it would be advantageous to be able to extract human readable lists from documents and to use the extracted lists.