Traditionally, Information Extraction (IE) has focused on satisfying precise, narrow, pre-specified requests from small homogeneous corpora (e.g., extract the location and time of seminars from a set of announcements). Shifting to a new domain requires the user to name the target relations and to manually create new extraction rules or hand-tag new training examples. This manual labor scales linearly with the number of target relations. Thus, IE has traditionally relied on extensive human involvement in the form of hand-crafted extraction rules or hand-tagged training examples. Moreover, the user is required to explicitly pre-specify each relation of interest. While IE has become increasingly automated over time, enumerating all potential relations of interest for extraction by an IE system is highly problematic for corpora as large and varied as the Web. To make it possible for users to issue diverse queries over heterogeneous corpora, IE systems must move away from architectures that require relations to be specified prior to query time in favor of those that aim to discover all possible relations in the text.
In the past, IE has been used on small, homogeneous corpora such as newswire stories or seminar announcements. As a result, traditional IE systems are able to rely on “heavy” linguistic technologies tuned to the domain of interest, such as dependency parsers and Named-Entity Recognizers (NERs). These systems were not designed to scale relative to the size of the corpus or the number of relations extracted, as both parameters were fixed and small.
The problem of extracting information from the Web violates all of these assumptions. Corpora are massive and heterogeneous, the relations of interest are unanticipated, and their number can be large. These challenges are discussed below in greater detail.
The first step in automating IE was moving from knowledge-based IE systems to trainable systems that took as input hand-tagged instances or document segments and automatically learned domain-specific extraction patterns. Certain prior approaches, including Web-based question answering systems, have further reduced the manual labor needed for relation-specific text extraction by requiring only a small set of tagged seed instances or a few hand-crafted extraction patterns, per relation, to launch the training process. Still, the creation of suitable training data required substantial expertise as well as non-trivial manual effort for every relation extracted, and the relations have to be specified in advance.
Previous approaches to relation extraction have employed kernel-based methods, maximum-entropy models, graphical models, and co-occurrence statistics over small, domain-specific corpora and limited sets of relations. The use of NERs, as well as syntactic or dependency parsers, is a common thread that unifies most previous work. But this rather “heavy” linguistic technology runs into problems when applied to the heterogeneous text found on the Web. While the parsers of the prior approaches work well when trained and applied to a particular genre of text, such as financial news data in the Penn Treebank, they make many more parsing errors when confronted with the diversity of Web text. Moreover, the number and complexity of entity types on the Web means that existing NER systems are inapplicable.
Recent efforts by others who are seeking to undertake large-scale extraction indicate a growing interest in the problem. This year, a paradigm was proposed by other researchers, for “on-demand information extraction,” which aims to eliminate customization involved with adapting IE systems to new topics. Using unsupervised learning methods, this earlier system automatically creates patterns and performs extraction based on a topic that has been specified by a user.
In addition, another research group described an approach to “unrestricted relation discovery,” that was tested on a collection of 28,000 newswire articles. This early work contains the important idea of avoiding relation-specificity, but does not scale to the magnitude of the problem of extracting information from the entire Web, as explained below. Given a collection of documents, the prior system first performs clustering of the entire set of newswire articles, partitioning the corpus into sets of articles believed to discuss similar topics. Within each cluster, named-entity recognition, co-reference resolution, and deep linguistic parse structures are computed and are then used to automatically identify relations between sets of entities. This use of “heavy” linguistic machinery would be problematic if applied to the Web, since the time requirement for extracting information would be too great.
This earlier approach uses pair wise vector-space clustering and initially requires an O(D2) effort where D is the number of documents. Each document assigned to a cluster is then subject to linguistic processing, potentially resulting in another pass through the entire set of input documents. This approach is far more expensive for large document collections than would be desirable and would likely not be practical for extracting information from a corpus of text the size of the Web.
KNOWITALL is a previously developed Web extraction system that addresses the automation challenge by learning to label its own training examples using a small set of domain-independent extraction patterns. KNOWITALL also addresses corpus heterogeneity by relying on a part-of-speech tagger instead of a parser, and by not requiring an NER. However, KNOWITALL requires large numbers of search engine queries and Web page downloads to extract the desired information from a corpus such as the Web. As a result, experiments using KNOWITALL can take weeks to complete. Finally, KNOWITALL takes relation names as input. Thus, the extraction process has to be run, and re-run, each time a relation of interest is identified. Instead, it would thus be desirable to employ a novel paradigm that retains KNOWITALL's benefits, while substantially avoiding its inefficiencies.