1. Field of the Invention
The present invention relates generally to a network data extraction system, and in particular to a system for automatically extracting data from semi-structured web sites.
2. Discussion of Prior Art
Previous approaches to web data extraction primarily fall into two main categories. First, “wrapper induction” systems often use the approach of learning site-specific rules for extracting data using positional and content cues. These systems may be trained to recognize and extract data from specific page types on a web site. A second approach includes systems that crawl through a web site looking for particular types of content, such as job postings or seminar announcements. These systems are site-independent, but they must generally be trained to recognize the targeted content.