The present invention relates generally to data acquisition and structuring systems and methods, and more particularly, a system and method for generating structured data outputs from semi-structured data inputs.
The general field of this invention relates to generating structured data outputs from semi-structured data inputs. A particular application of the invention is acquiring and structuring data to form virtual internet databases. Virtual internet databases are databases whose content is owned, stored and managed on servers distributed across a computer network.
Recently, internet usage and access has increased markedly. The availability and quantity of information on the internet has also increased. Many software products that can produce printed reports can now produce WEB reports. These products produce reports that may be displayed on a WEB page. This is accomplished by embedding the text of the report within the computer language called HTML. Although posted reports and information appear as data on the WEB page, this HTML representation is not a data representation. Rather, the WEB browser serves as a vehicle to display information much like that of a page in a textbook. This presents the problem of incompatibility between the HTML representation and the PC desktop and server applications. Ultimately, the current practice of employing WEB browsers has reduced PCs back to xe2x80x9cdumbxe2x80x9d terminals. The graphics may be exciting, but functionally all the computing power is limited to providing users with little more than a sophisticated data viewing window.
Several methods have been developed to address the problem of moving semi-structured data from the internet to a PC or server application. These methods include ad hoc engineering methods, Graphical User Interface (GUI) methods, and machine learning methods.
Ad hoc methods entail writing specialized parsing programs in a language such as PERL or LEX to extract the necessary information. These types of programs are called wrappers. A wrapper is a software method that converts data such as HTML code into structured data for further processing. These types of programs employ the use of regular expressions in the parsing process. Unfortunately, these ad hoc methods are labor intensive. Depending on the skill of the programmer and the complexity of the particular job, these methods can take days to develop. Also, these methods are not an option for an average internet user with no formal training or knowledge of HTML and programming methods.
Due to the tedious nature of custom wrapper design, further methods have been developed that employ GUIs to facilitate the wrapper generation. The GUI hides all the engineering details beyond the extracted data pattern definitions. Like the ad hoc methods discussed above; these packages implement regular expression parsing algorithms. In general these methods require some knowledge of both HTML and regular expressions, therefore they may not be suitable to some internet users.
Due to the use of regular expressions, both ad hoc methods and GUI methods can result in what is called brittle parses. Brittle parses result when changes in format of the HTML page cause the parse to fail. A single format change is not guaranteed to break the parse, but the likelihood is sufficiently high as to prevent any guarantees of robust behavior.
Recently, machine learning methods have been developed to address the need for engineering skills in the development of wrappers. Given a set of similar WEB pages and an example of the data to be parsed from each page, these methods automatically generate a wrapper. Unfortunately, these methods require a large number of examples to reliably produce wrappers. An example of such a method can be found in A Hierarchical Approach to Wrapper Induction, Muslea, et al. (1999). This method may require 8-10 examples to produce the wrappers. The generated wrappers are based on regular expression techniques and are brittle. Although these wrappers may work for format changes known prior to wrapper generation, they may fail on empirical format changes as the regular expression based methods discussed above.
Ideally, it is desirable to develop a method for a user to gain access to semi-structured data for a PC or server application without requiring the user to have previous knowledge HTML or regular expressions. In addition, it is advantageous if the method does not require the enumeration of examples covering possible format changes.
The present invention provides a system and method for acquiring and structuring data from semi-structured data sources that substantially eliminates or reduces disadvantages and problems associated with previously developed systems and methods used for developing structured data sources from on-line sources such as the Internet, intranets, or other network systems.
More specifically, the present invention provides a system for generating structured data outputs from semi-structured data sources. The steps of this method include generating an example output from an example generator. The example output is generated in response to the acquisition of a sequence of annotated strings. The annotated strings are generated in response to the acquisition and modification of as little as one data example and a corresponding coarse structure from a predetermined input source. Also, a second sequence of annotated strings in generated from input from a semi-structured data source. Both the example output and second sequence of annotated strings are input to an acquisition engine that implements a grammar layer incorporating a top-down parsing method and a comparison layer. The structured data outputs are generated through the cooperation of the comparison layer and the grammar layer.
The present invention provides an important technical advantage in that it does not require the user to have knowledge of HTML or knowledge of pattern matching languages. The graphical interface guides the user through a set-up phase and completely hides all technical details.
The present invention provides an important technical advantage in that it requires only one single data example. Once this set-up process is complete, the acquisition engine can be pointed to related WEB pages, as well as up-dated versions of the same page, and it will automatically extract data and route it to applications.
The present invention provides yet another technical advantage in that the system is able to cope with the format changes from the source pages, including changes in the order of data values. Thus, the technology produces reliable results even when the data sources are re-formatted, updated or amended by the content providers.