Much information exists in an unstructured form. For example, information may be found in the form of prose, contained in a text document or in the text portion of a Hypertext Markup Language (HTML) document. One type of information-processing task is to extract structured information, such as unary relations, from an unstructured document. Examples of unary relations are “is_a_painter,” “is_a_researcher,” “is_a_camera,” etc. A document might contain words and phrases suggesting that the entity “John_Smith” is a painter (e.g., “We admired the oil paintings by John Smith . . . ”). Based on the existence of these words, a classifier might extract the unary relation is_a_painter(John_Smith) from the document. Intuitively, a unary relation is like a label. To extract the relation is_a_painter(“John_Smith”) from a document is to say that “painter” is (or may be) an appropriate label for the entity “John Smith.” Being able to mine this type of information from document collections may help to answer complex queries, such as “painters whose work adorns the halls of the Metropolitan Museum.” Answering this query might involve obtaining a list of painters, so unary relation extraction can assist with identifying those entities that are painters.
One way to extract unary relations from documents it to recognize an entity name in the document and to make inferences about the entity from some context, such as the words surrounding the entity name. So if “John Smith” is recognized as an entity name, then the phrase “paintings by John Smith are oil-on-canvas” strongly suggests that John Smith is a painter. A classifier might extract the unary relation is_a_painter(John_Smith) from a single document based on the presence of this statement in that document. However, some statements suggest, more weakly, that John Smith is a painter. For example, the statement “John Smith's work is in the museum,” might mean that John Smith is a painter, but might also mean that he is a sculptor, or a paleontologist, or the museum's accountant. Similarly, other documents might say “John Smith works with oil” (which might imply a painter or an auto mechanic), or “John Smith works with brushes” (which might imply a painter or a hairdresser). While each of these statements says something about whether John Smith is a painter, it is difficult to conclude that John Smith actually is a painter from a document that contains one of these weak statements but no stronger statement.