The invention relates generally to computational semantic analysis. More specifically, the invention relates to a method and system that automatically maps text content to entities defined in an ontology.
It has become increasingly difficult to locate desired information from the Internet or content management systems. With an ever increasing amount of content, search engines are heavily relied on to search for information and documents. However, existing search tools are experiencing difficulties. Keyword based searches often return results with low precision and recall.
With the emergence of Social Web, user-generated tags can be used by search engines or users to improve search results. Tagging helps users to describe, find and organize content. Well defined tags give brief information about the content of tagged documents. Therefore, tags provide a practical way to determine whether a document is of interest without reading it. Users or search engines can use tags to mitigate the issues of low precision and recall.
Content tagging or collective tagging can improve search results by allowing search engines to exploit tags from the wisdom of crowds. However, the improvement is limited because tags are: (1) free from context and form, (2) user generated, (3) used for purposes other than description, and (4) often ambiguous. Since tagging is a voluntary action, some documents are not tagged at all. Furthermore, interpretation of the tags associated with tagged documents also remains a challenge.
To facilitate better content management and search, current content tagging systems need improvement. Current tags usually fail to capture exact meanings and contexts of keywords because of polysemy. Human language is ambiguous. Words may have different meanings according to the context in which they are used. Moreover, content taggers may use noisy and misleading tags such as subjective (e.g., cool, nice), misspelled and unrelated tags with content. In addition, tagging requires extra effort that makes the process expensive and time consuming. Therefore, most content authors generally do not assign metadata to their documents. It is estimated that 80% of business documents are maintained in an unstructured format.
The above limitations motivated research into automated semantic tagging systems. Automatic tagging systems can analyze given documents and offer significant terms as tags without user intervention. Automatic tagging offers advantages such as accuracy, consistency, standardization, convenience and decreased cost. Unless tags are represented in a computer understandable and processable way, automatic tagging systems return errors.
Ontologies are a key enabling technology for the Semantic Web. The assignment of ontological entities (terms interlinked by links of relationships between terms) to a content is called Semantic Tagging or Semantic Annotation. Semantic tags give content well-defined meaning, and can automate content tagging and search with more accurate and meaningful results. Semantic tagging advances automatic tagging by providing more meaningful tags of ontological entities instead of context-less keywords and making content and tags computer understandable.
As a formal declarative knowledge representation model, ontology provides a foundation upon which machine understandable knowledge can be obtained and tagged, and as a result, it makes semantic tagging and search possible. Performance of a semantic tagging and search application is highly dependent on its ontology. A term within content can be semantically tagged and retrieved, if it is properly defined in the ontology. Common ontological knowledge bases such as WordNet and Wikipedia can be used for this purpose, but they have limitations.
UNIpedia, developed by Siemens Corporate Research, serves as a high quality, comprehensive, up-to-date, domain independent resource for semantic applications. UNIpedia uses WordNet as its backbone ontology, and maps instances from other knowledge bases to WordNet concepts by introducing an isA relationship between them. By combining WordNet, Wikipedia and OpenCyc, the current version of UNIpedia consists of 2,242,446 terms, 74,390 concepts and 1,491,902 instances.
There are three classes of semantic tagging systems: (1) manual, (2) semi-automatic, and (3) automatic.
In manual tagging systems, users tag documents with a controlled vocabulary defined in an ontology. Manual tagging is a time consuming process which requires deep domain knowledge and expertise, but also introduces inconsistencies by human annotators. SemaLink is a manual semantic tagging system.
Semi-automatic systems analyze documents and offer ontological terms from which annotators may choose. Semi-automatic systems may use humans to disambiguate terms. Faviki is a semi-automatic tagging system that brings together social bookmarking and Wikipedia. It offers DBpedia entities to users to tag web pages with.
Automated semantic tagging systems analyze documents and automatically tag them with ontological concepts and instances. Zemanta is an automatic semantic tagging system that suggests content from various sources such as Wikipedia, YouTube, Flickr and Facebook. Zemanta disambiguates terms and maps them to a Common Tag ontology.
SemTag is another automatic tagging system. SemTag uses Taxonomy-Based Disambiguation (TBD) to disambiguate terms and maps documents to entities defined in an experimental knowledge base. The knowledge base is not a comprehensive knowledge base and consists of only 72,000 concepts.
Automatically mapping a polysemous word to an appropriate sense (meaning) according to its context, is called Word Sense Disambiguation (WSD). WSD is a challenge in semantic tagging. There are three main approaches to WSD: (1) supervised, (2) unsupervised, and (3) knowledge-based.
Supervised approaches use sense annotated data sets as a training data for learning disambiguation patterns. Support Vector Machines (SVMs), Decision Trees and Neural Networks are widely used supervised WSD algorithms. In contrast, unsupervised systems use a raw corpus as training data to learn disambiguation patterns. Word Clustering and Co-occurrence Graphs are examples of unsupervised techniques. Both approaches require training data and are computationally expensive.
Knowledge-based approaches use structured resources such as Machine Readable Dictionaries (MRDs) and ontologies. These methods are preferred because of their wider coverage despite their lower performance in comparison to machine learning approaches. There are three knowledge based techniques in WSD: (1) sense definitions, (2) selectional restrictions, and (3) structural approaches.
Sense definitions disambiguate senses by comparing and counting the number of overlapping words between sense descriptions. Sense definitions are very sensitive to the words used in sense descriptions and perform poorly when compared to other knowledge-based algorithms.
Selectional restrictions disambiguate senses by restricting possible meanings of senses according to their surrounding words. Selectional restrictions also exhibit low performance.
Structural approaches disambiguate senses based on the semantic interrelationships of concepts. In a local context, semantic similarities between pairs of words are calculated according to similarity measures. The performance of structural approaches is dependent on the richness of a knowledge base in terms of the defined semantic interrelationships. Its performance is higher compared to knowledge-based approaches, but lower compared to supervised methods.
The limitations discussed above provide the motivation for an improved automated semantic tagging method and system.