Given a large collection of documents, there are various applications that benefit from having a relatively small subset of these documents filtered and identified in an appropriate way, or to extract certain entity information (e.g., words or phrases) from only relevant documents, or both. By way of example, consider that the collection to be processed comprises the large number of documents on the web, on the order of billions. An example entity extraction task may be to identify mentions of book titles within the web pages, given a prepared list of the desired book titles.
The task of extracting entities is difficult when some of the entities in the provided list have a significant overlap with entities in other domains or with the underlying language of the documents or both. For example, consider the movie “seven” (ignoring uppercase versus lowercase) among a list of movie titles to extract. There are many documents that contain the term “seven” that have nothing to do with the movie, e.g., there are seven days in a week, the distance to a location is seven miles, and so on. This overlap makes it very difficult to disambiguate relevant (“true”) mentions of such entities with respect to the domain from irrelevant (“false”) mentions.
Further, there is generally very limited domain-based information in terms of available training data, or in terms of available classifiers for entity extraction tasks or both. In general this is because there is a significant variety of such entity lists for which extraction is desired, and differing entity domains over which extraction may be performed, each domain having to have a classifier trained with knowledge of the specific domain. Indeed, such data may be entirely absent for an entity list or domain. By way of example, there may not be a classifier available for an entity list comprising romantic movies. Even if one exists, running such a classifier over such a large document collection may not be practical as a classifier tends to have large amount of performance overhead.
Another difficulty arises from the large size of the underlying document collection, which limits the time that can be spent on each document for extraction purposes. The large size of the document collection makes it impractical to identify all mentions of entities over the entire document collection as an intermediate step, followed by a subsequent step that removes false mentions. This is even worse in the presence of entities that overlap with the underlying language of the document, e.g., materializing mentions of “man” over web pages can lead to millions of web page URLs in which only a small fraction of the pages refer to a movie named “man.”