Search engines are the current user interface to the Internet. Users often rely heavily on the ability of search engines to provide relevant results. Semantic search aims to improve upon traditional search algorithms, such as Google's Page Rank, by incorporating semantics (meaning), to produce more relevant search results by understanding search intent and contextual meaning.
It is known to classify documents by their contents, if the contents are structured. For example, U.S. Pat. No. 5,676,710 to Lewis teaches a method and apparatus for training a text classifier. Documents are classified with respect to pre-defined classes in a supervised setting, where the documents are first machine annotated, then finally classified using a combination of supervised and unsupervised learning. Similarly, U.S. Pat. No. 7,756,800 to Chidlovskii teaches a method and system for classifying documents based on instances of various structured elements within them.
However, to enable semantic search for unstructured documents, it is essential to have tools that can extract structured data from these documents. Unfortunately, extracting meaning from documents that do not provide annotations is an extremely challenging task. This task is particularly challenging, for example, when extracting semantic infonnation for a restaurant menu provided as PDF document or an image. Without semantic annotations, it is difficult to determine which text entries refer to section titles, dish names, descriptions, or specific annotations.
Previous work in this area known to the inventors has relied on supervised learning techniques that attempt to create models that can classify items based on carefully annotated data sets. U.S. Pat. No. 7,756,807 to Komissarchik et al. teaches methods that extract facts from unstructured documents, such as a web page. These facts include the title of the page, an article body, section headers, names of people and companies, and so on. Undesirably, this approach suffers from many false positives and false negatives (e.g., misclassifying items as sections) as it relies solely on content and context provided by the extracted text. In fact, due to the varying nature of documents, such as menus, techniques that rely solely on automated machine learning techniques suffer from some form of false positives and false negatives.
To the extent that information can be extracted from such documents, it may be stored in an intermediate representation. For example, U.S. Pat. No. 7,685,083 to Fairweather describes a system for converting unstructured data into a normalized form. The data are tied to a system ontology that can be ‘mined’ for information.