This specification relates to extracting semantic classes and corresponding instances from a collection of text.
A conventional knowledge base is a type of database for use in the collection, organization, and retrieval of particular types of information. Knowledge base generation has increased with the availability of large collection of text (e.g., Web documents) and data sources (e.g., query logs). Conventional extraction processes for unstructured text have a particular class of knowledge (e.g., a category of information) manually specified in advance. Typically, a small set of manually selected instances representative of the class of interest are input to train an extraction process for the collection of text. This includes, for example, instances within the same class, attributes of the class, or relations which involve the class.
An instance in the text collection represents text (e.g., of Web documents) identified as being associated with the respective class. Instance-class relationships can be extracted using particular text patterns as templates. For example, each instance X and a class Y can follow the templates X is a Y or Y such as X, X′, and/or X″. For example, “a beagle is a dog” indicates that “beagle” is an instance of the class “dog”. Similarly, “a dog such as a beagle, corgi, or keeshond” indicates that “beagle” “corgi” and “keeshond” are instances of the class “dog”. The manual selection of classes typically results in a small number of course-grained classes, e.g., Location, Person, or Organization. The knowledge base is generated using the extracted content corresponding to the manually specified classes.