The present invention relates to extracting information from a database. More specifically, the invention relates to searching for tuples of information in order to identify patterns in which the tuples were stored so that additional tuples can be extracted from the database.
The Internet, and more particularly the World Wide Web (“Web”), is a vast repository of information that is extremely distributed, both in the location and manner in which the information is stored. For example, a particular type of data such as restaurant lists may be scattered across thousands of independent information sources (e.g., hosts) in many different formats.
One way that this information is extracted from the Web is by individual users that traverse (“surf”) the Web to locate information of interest and manually extract the information. It should be quite evident that this method is very tedious and does not easily provide a comprehensive search. Although multiple users can be employed to perform thin manual information extraction, the cost for mining the desired information from the Web is extremely high and does not provide adequate coverage of the Web.
There has also been considerable work on integrating a number of information sources using specially coded wrappers or filters. Although this work has met with some amount of success, the creation of wrappers can be quite time consuming and thus, is usually suited for only tens, not thousands (or more) of information sources. Considering the vast size of the Web and its continual growth, the manual creation of wrappers does not provide an efficient mechanism for extracting information from a database such as the Web.
Therefore, what are needed are innovative techniques for extracting information from databases. Additionally, it would be desirable if the relevant information was extracted from the numerous and distributed information sources automatically or with very minimal human intervention