It is important to ascertain the meaning underlying an input query in order to provide accurate and user-intended search results. A tagger can be used to tag each word in an unstructured query with a type, by means of a Markov model or conditional random field (CRF) tagger. The CRF tagger annotates ngrams (e.g., individual words, substrings, or phrases) in the query with labels. As an example, a CRF may label {u2 desire} as {BAND:u2 SONG:desire}. Individual words are stitched together to form a canonical entity (e.g., a person, place, or thing). A canonical entity has a string value assigned to it to apply a value or meaning to the entity. These canonical entities derived from the input query are forwarded to a downstream infrastructure for content index searching using the values of the fields. Candidate documents obtained from the downstream infrastructure are surfaced to the user in the form of search results.
Returning accurate and user-intended results can be difficult if the query contains implicit entity references (i.e., entity is inferred) rather than explicit entity references (i.e., entity is specified). Many times, an entity is embedded within the query as with implicit entity references. Misspelled entity references, as well as extraneous words, synonyms, nicknames, and alternate forms of a word cause additional difficulties in returning user-intended results. It is estimated that well over half of all input queries are altered in some way from the correct name or description. The most frequently altered type of queries are due to a high frequency of misspellings for named entities. In other cases, the primary information is not even present in the actual query. Therefore, a conventional entity tagger based on CRF will not be able to identify or retrieve content based upon the actual intended entity. Generally, if relevant explicit terms are not present in a query or the CRF has tagged irrelevant terms, then the downstream infrastructure will have difficulty in ascertaining which entity should be retrieved from web search indexes. An alternative approach to tagging words, correcting misspelled words, stitching words together and filling in the gaps, and canonicalizing words or entities is needed. An improved system for processing entities, such as implicit, non-canonical, and/or misspelled reference entities is desirable, regardless of how the entity is referenced in the input query.