1. Field
The present application relates generally to associating metadata with content, and, in particular, to automatic tagging of documents using tagging tools that may be trained in subject matter domains different from that of the documents to be tagged.
2. Related Art
Annotation of documents with semantic information, such as tags that categorize words and phrases in the documents into types, is known in the art. Existing computer-based systems such as search engines and databases are generally focused on the literal text content of documents and data. However, meaning, i.e., semantic information, may be attached to specific portions of text in those documents. For example, the word “Paris” in a document may be tagged, i.e., associated or annotated, with a tag that has the value “Location”. The tag may then be used in searches, e.g., so that the document can be presented as a search result in a search for locations.
The documents to be annotated may be, for example, web pages and other types of textual content, e.g., the Wikipedia® online encyclopedia. The tags, which may be text strings, may be selected from a set of predefined tags such as categories, e.g., Person, Organization, Location, or any other descriptive label. Tagging may be performed by a human, but such hand-tagging is a slow and labor-intensive process. A tagging tool, referred to herein as a “tagger”, processes textual content by selecting tags that correspond to portions of the content. The term “tag” also refers to an operation of selecting a tag to be associated with a portion of content, e.g., with one or more words, and also to the operation of associating the tag with the portion of content. The tagger may be, for example, a statistical tagger that is trained using text documents that have associated typed tags. A tagger trained on particular content and associated tags may be used to tag other content. However, training data for statistical taggers is relatively scarce and is specialized in particular subject-matter domains such as news. The documents to be tagged are often in domains other than the training data. Therefore, it would be desirable to have an automatic system for tagging documents in domains other than the hand-generated training data.