Two aspects of contemporary search engines are a natural language enabled user interface for entering flexible and user-friendly queries and a semantic search, which understands user intent and recognizes contextual meaning of search terms. Semantic search and question answering systems and methodologies where queries may be entered as natural language phrases have been developed over the past decades and resulted in numerous general purpose as well as content and information source specific desktop and mobile semantic search and question answering engines.
Semantic search portals include Bing, Google Search with Knowledge Graph, Facebook Graph Search, International Digital Media Archive, Legal Intelligence, SILVIA (for images), Thinkglue (for video), Wolfram Alpha, etc., while mobile and desktop semantic search utilities include an embedded Search in Windows Explorer, Apple Siri, Google Search for Android, Amazon Alexa, Copernic and other engines. For example, searching a Documents folder on a Windows PC with an enabled natural language search option allows a user to find files satisfying certain natural language queries, such as “images last week” or “large pdf”, pertaining to various types of personal content, size, creation and update time of items and other content parameters.
Semantic search methodologies include operations with advanced types of metadata, such as RDF path traversal or OWL inference using World Wide Web Consortium's specifications for Resource Description Framework and Web Ontology Language (both are considered elements of the Semantic Web), Keyword to Concept Mapping, various methods of fuzzy logics, Explicit Semantic Analysis (ESA), Generalized Vector Space Model (GVSM), etc. Expanding semantic search boundaries, improving efficiency and adapting semantic search technologies to increasing set of applications remains an actual task for academic and industry researchers and for technology companies.
Three requirements for efficient creation and functioning of general purpose and specialized semantic search engines are building comprehensive and reliable training datasets, creating adequate models of extracted semantic information, and extracting semantic information from the data sets. Frequently used candidates for training data sets at the start of building semantic engines include WordNet, Open Directory Project and other resources satisfying RDF standards, as well as Lexical markup framework, UNL Programme, etc.
Another popular source of comprehensive text collections is Wikipedia, currently available in 291 languages. Wikipedia provides a unified structure for articles and internal links therein. Wikipedia articles possess, for the most part, high quality content; articles with questionable quality, objectiveness or completeness are normally supplied with editorial prefixes, making it easy to automatically identify and exclude such articles from a dataset; additionally, a history of creation and editing of an article may offer a supplementary evidence to assess validity of the article. Expanding the corpora of reference materials to Wikipedia articles, known as the wikification technique, have already proven it fruitful in various Natural Language Processing studies. One recent example included supervised learning on anchor texts in Wikipedia for the Named Entity Disambiguation task under the entity linking approach.
Notwithstanding significant progress in utilizing various linguistic corpora for training, the problem of isolating semantic textual units in training datasets remains largely an open-ended task. For example, systematic usage of wikification for semantic search has been limited to superficial works that treated Wikipedia articles as a whole and ignored significant noise created by this approach.
Accordingly, it is desirable to develop mechanisms for building large training datasets for semantic search using various information sources.