This disclosure relates to information extraction.
The Internet provides access to a wide variety of resources, e.g., video and/or audio files, web pages for particular subjects, news articles, and so on. Resources of particular interest to a user can be identified by a search engine in response to a user query. The user query includes one or more search terms, and the search engine uses these terms to identify documents that are responsive to the user query.
The semantics related to the words in a language can be used to derive semantic relations among the words. A search engine can use these semantic relations as an aid in finding documents or terms that are related to the terms of the user query. In one framework, semantic concepts are labeled according to classes, with each class representing a particular semantic concept. For example, the semantic concept of painkillers can be represented by the class of the same name. Each class has one or more instances that belong to the class. An instance is an object that belongs to the class. For example, the class “painkillers” includes the instances of cloxacillin, vicodin, and other types of drugs that are typically classified as painkillers. Each instance, in turn, can have one or more attributes, each of which describes a quality or characteristic of the instance. Knowing what attributes are associated with the instance described by the search term (e.g., whether “cost” or “side effects” is associated with “cloxacillin”) can help the search engine in the search process.
Various information retrieval frameworks that derive attributes from text exist. In a framework, information instances (e.g., “cloxacillin”) that are semantic objects that belong to specific semantic concepts are identified, as are information classes (e.g., “antibiotics”). To extract attributes, a conventional method can submit list-seeking queries that describe the instance or class (e.g., “cloxacillin” or “antibiotics”) as search terms to general-purpose web search engines and analyze documents retrieved in response to the queries. Common structural patterns (e.g., Hyper-Text Markup Language (HTML) structures) in the retrieved documents are used to extract the attributes of the information instance or class.