The present invention relates to extracting information from a database. More specifically, the invention relates to searching for tuples of information in order to identify patterns in which the tuples were stored so that additional tuples can be extracted from the database.
The Internet, and more particularly the World Wide Web (xe2x80x9cWebxe2x80x9d), is a vast repository of information that is extremely distributed, both in the location and manner in which the information is stored. For example, a particular type of data such as restaurant lists may be scattered across thousands of independent information sources (e.g., hosts) in many different formats.
One way that this information is extracted from the Web is by individual users that traverse (xe2x80x9csurfxe2x80x9d) the Web to locate information of interest and manually extract the information. It should be quite evident that this method is very tedious and does not easily provide a comprehensive search. Although multiple users can be employed to perform this manual information extraction, the cost for mining the desired information from the Web is extremely high and does not provide adequate coverage of the Web.
There has also been considerable work on integrating a number of information sources using specially coded wrappers or filters. Although this work has met with some amount of success, the creation of wrappers can be quite time consuming and thus, is usually suited for only tens, not thousands (or more) of information sources. Considering the vast size of the Web and its continual growth, the manual creation of wrappers does not provide an efficient mechanism for extracting information from a database such as the Web.
Therefore, what are needed are innovative techniques for extracting information from databases. Additionally, it would be desirable if the relevant information was extracted from the numerous and distributed information sources automatically or with very minimal human intervention.
The present invention provides innovative techniques for extracting information and patterns from a database such as the Web. One can begin with one or more tuples of information that act as the initial seed for the search. The database (or databases) is searched for occurrences of the tuples and patterns are identified in which they are stored. These patterns are used to extract more tuples from the database and the process can be repeated for the new tuples. Information can be extracted from a database efficiently and accurately with little or no human interaction. Some specific embodiments of the invention are described below.
In one embodiment, the invention provides a computer implemented method of extracting information from a database. The database is searched for occurrences of at least one tuple of information. An occurrence of a tuple of information that was found is analyzed to identify a pattern in which the tuple of information was stored. Additional tuples of information are extracted from the database utilizing the pattern. In some embodiments, the process is repeated until a predetermined number of tuples are found or until no new patterns are identified.
In another embodiment, the invention provides a computer implemented method of extracting information from a database. The database is searched for occurrences of tuples of information. Occurrences of the tuples of information that were found are analyzed to identify a pattern in which the tuples of information were stored. A pattern includes a prefix text, a middle text and suffix text, where the prefix text precedes desired information in the tuples of information, the middle text is between desired information in the tuples of information and the suffix text follows desired information in the tuples of information. Additional tuples of information are extracted from the database utilizing the pattern and the process is repeated for additional tuples of information.