Search engines are the current user interface to the Internet. Users often rely heavily on the ability of search engines to provide relevant results. Semantic search techniques aim to improve upon traditional search algorithms by incorporating semantics (meaning) to produce more relevant search results by understanding search intent and contextual meaning.
It is known to classify documents by their contents, if the contents are structured. Documents are classified with respect to pre-defined classes in a supervised setting, where the documents are first machine annotated, and then finally classified using a combination of supervised and unsupervised learning. Similarly, U.S. Pat. No. 7,756,800 to Chidlovskii teaches a method and system for classifying documents based on instances of various structured elements within them.
However, to enable semantic search for unstructured documents, it can be necessary to have tools that can extract structured data from these documents. Unfortunately, extracting meaning from documents that do not provide annotations is an extremely challenging task. This task is particularly challenging, for example, when extracting semantic information for a company's price list (e.g., a restaurant menu) provided as PDF document or an image. Without semantic annotations, it is difficult to determine which text entries refer to section titles, dish names, descriptions, or specific annotations.