1. Field of the Invention
This invention relates to data mining, and more particularly to a system that determines data characteristics for processing purposes and applications. By determining data characteristics, recognition and extraction of data for information processing purposes is facilitated.
2. Description of the Related Art
The proliferation of online information sources has accentuated the need for tools that automatically validate and recognize data. As the size and diversity of online information grows, many of the common transactions between humans and online information systems are being delegated to software agents. Price comparison and stock monitoring are just two examples of the tasks that can be assigned to an agent. Many protocols and languages have been designed to facilitate information exchange, from primitive EDI protocols to more modern languages like XML and KQML. However, these schemes all require that the agents conform to syntactic and semantic standards that have been carefully worked out in advance. Humans, in contrast, get by with only informal conventions to facilitate communications, and rarely require detailed pre-specified standards for information exchange.
In examining typical web pagesxe2x80x94at electronic catalogs, financial sites, etc.xe2x80x94we find that information is laid out in a variety of graphical arrangements, often with minimal use of natural language. People are able to understand web pages because they have expectations about the structure of the data appearing on the page and its physical layout. For instance, if in looking at online Zagat""s restaurant guide, one finds a restaurant""s name, an address, a review, etc. Though none of these are explicitly labeled as such, the page is immediately understandable because people expect this information to be on the page, and have expectations about the appearance of these fields (e.g., people know that a U.S. address typically begins with a street address and ends with a zip code). The layout of the page helps to demarcate the fields and their relationships (e.g., the restaurant name is at the top of the page; the phone and address appear close together, etc.)
Several researchers have addressed the problem of learning the structure of data. Grammar induction algorithms, for example, learn the common structure of a set of strings. Carrasco and Oncina (1994) propose ALERGIA, a stochastic grammar induction algorithm that learns a regular language given the strings belonging to the language. ALERGIA starts with a finite state automaton (FSA) that is initialized to be a prefix tree that represents all the strings of the language. The FSA is generalized by merging pairs of statistically similar subtrees. ALERGIA tends to merge too many states, even at a high confidence limit, leading to an over-general grammar. The resulting automation frequently has loops in it, corresponding to regular expressions like a(b*)c. However, the data in a single field is seldom described by such repeated structures.
Goan et al. (1996) proposed modifications to ALERGIA aimed at reducing the number of bad merges. They also introduced syntactic categories similar to ones in the present invention. Each symbol can belong to one of these categories. Goan et al. added a new generalization step in which the transitions corresponding to symbols of the same category that are approximately evenly distributed over the range of that category (e.g., 0-9 for numerals) are replaced with a single transition. Though the proposed modifications make the grammar induction algorithm more robust, the final FSA is still sensitive to the merge order. Moreover, it does not allow for multi-level generalization, found to be useful. The algorithm requires dozens, if not hundreds, of examples in order to learn the correct grammar.
FOIL (Quinlan 1990) is a system that learns first order predicate logic clauses defining a class from a set of positive and negative examples of the class. FOIL finds a discriminating description that covers many positive and none of the negative examples.
FOIL.6 (Quinlan 1990) was used with xe2x80x98xe2x88x92nxe2x80x99 option (no negative literals) to learn data prototypes for several data fields. In all cases, the closed world assumption was used to construct negative examples from the known objects; thus, names and addresses were the negative examples for the phone number class for a white pages source. In most cases there were many similarities between the clauses learned by FOIL and the patterns learned by DataPro(trademark) (the name of the system of the present invention); however, the descriptions learned by FOIL tended to be overly general. Thus, if the rule learned from the example was that a xe2x80x98(xe2x80x99 was a sufficient description of phone numbers in the presence of examples of addresses and names, such a description cannot be generalized as xe2x80x98(xe2x80x99 will not be sufficient to recognize phone numbers on a random page. Such over-generalization may arise from the incompleteness of the set of negative examples presented to FOIL. Another problem arises when FOIL is given examples of a class with little structure, such as names and book titles. FOIL tends to create clauses that only cover a single example, or fails altogether to find any clauses.
The present invention provides an efficient algorithm that learns structural information about data from positive examples alone. Two Web wrapper maintenance applications may employ this algorithm. The first application detects when a wrapper is not extracting correct data. The second application automatically identifies data on Web pages so that the wrapper to may be re-induced when the source format changes.
An important aspect of the present invention is that it focuses on generalizing token sequences according to a type hierarchy. Most previous work in the area has focused on generalizations that capture repeated patterns in a sequence (e.g., learning regular expressions). Though this direction has not yet attracted much attention, it is believed that it will prove useful for a wide variety of data validation tasks.
Future possibilities for the DataPro(trademark) algorithm of the present invention include cross-site extraction, e.g., learning the author, title and price fields for the Amazon site, and using them to extract the same fields on the Barnes and Noble web site. Preliminary results show that this is feasible, though challenging. Additionally, the DataPro(trademark) approach may be used to learn characteristic patterns within data fields (as opposed to just start and end patterns). The algorithm is efficient enough to be used in an xe2x80x9cunanchoredxe2x80x9d mode, and there are many applications where this would be useful.
The present art is advanced by the present invention which provides a machine method for acquiring expectations about the content of data fields. The method learns structural information about data to recognize restaurant names, addresses, phone numbers, etc. This present invention provides two applications related to wrapper induction that utilize structural information. A web page wrapper is a program that extracts data from a web page. For instance, a Zagat""s wrapper might extract the address, phone number, review, etc., from Zagat""s pages. Zagat""s restaurant web site presents information regarding restaurants, including reviews. The first application involves verifying that an existing wrapper is extracting correctly. The second application involves re-inducing a wrapper when the underlying format of a site changes. Although the present invention may focus on web applications, the learning technique is not web-specific, and can be used to learn about text and numeric fields in general. It is contemplated that the present invention is a step towards true agent interoperability, where agents can exchange and aggregate data without needing to know in advance about the detailed syntactic and semantic conventions used by their partners.
It is an object of the present invention to provide a method for learning the structure of data fields.
It is an object of the present invention to provide a process by which data in a semi-structured form may be made subject to extraction.
It is an object of the present invention to facilitate wrapper generation for web pages.
It is an object of the present invention to provide a process for verifying the integrity and reliability of information extracted by a wrapper.
It is an object of the present invention to provide a process to re-induce wrapper rules when the structure of underlying data has been changed.