Parsing unstructured local web queries is often tackled using simple syntactic rules that tend to be limited and brittle. Many search systems employ field-based query forms to support complex user needs and the underlying search algorithms are designed to utilize individual values in each field. Unstructured web queries therefore need to be parsed into field-based queries before being fed into the search systems. Semantic parsing in computational linguistics aims to convert natural language sentences into semantic frames consisting of a list of name and value pairs, as is discussed in Gildea, et al., “Automatic Labeling of Semantic Roles,” Computational Linguistics, 28(3):245-288 (2002) (“Gildea”) and Pradhan, et al., “Semantic Role Parsing: Adding Semantic Structure to Unstructured Text,” Proc. of ICDM (2003) (“Pradhan”). However, most approaches, if adopted for query parsing, require query level grammars or labeled data that are not always available.
For example, for field-based search systems usually do not have labeled unstructured queries. In addition, it requires nontrivial work for re-iterating over previous labeled data to support additional semantic classes. In practice, more often people have a large amount of logs for particular form fields or semantic classes. For example, one might want to extract search terms and geographic locations from web queries, but there is no such data set available without nontrivial work to do categorization and human labeling. Instead, most local search web sites have query logs for each semantic class (e.g., yellowpages.com and citysearch.com) or others, e.g., Local Search Engines like Yahoo!, Local Search and Social Local Sites like Yelp.com or Qype. A major challenge relates to building robust parsers while using field-based logs to overcome the data problem of lacking of query level grammars/labels.
Geographic queries consist of a large portion of general web queries. Although correctly parsing geographic queries is useful for query formulation in both general web search and local search, most of previous work, such as is discussed in Martins, et al., “Handling Locations in Search Engine Queries,” Proc. of GIR (2006) (“Martins”) and Guillen, “GeoCLEF2007 Experiments in Query Parsing and Cross Language GIR,” Working Notes of CLEF (2007) (“Guillen”) has used simple syntactic rules that tend to be limited and brittle.
Web search queries using natural language present problems for both natural language processing (“NLP”) and information retrieval (“IR”). Natural language researchers have developed various semantic parsers including as discussed in Gildea and Pradhan noted above, as more semantic resources such as FrameNet and PropBank have become available, as discussed, respectively in Baker, et al., “The Berkeley FrameNet Project,” Proc. of COLING/ACL (1998) and Kingsbury, e. al., “Adding Semantic Annotation to the Penn Treebank,” Proc. of HLT (2002). Most semantic parsers focus on general domains or dialogue systems, such as are discussed, respectively in Feng, et al., “Semantics-Oriented Language Understanding with Automatic Adaptability,” Proc. of NLP-KE (2003) and Bhagat, et al., “Shallow Semantic Parsing Despite Little Training Data,” Proc. of IWPT (2005).
However, such cannot be directly applied to geographical web queries. Most IR research on query formulation has focused on developing interactive interfaces to facilitate query formulation, as discussed in Trigoni, “Interactive Query Formulation in Semi-Structured Databases,” Proc. of FQAS (2002), or strategies helping refine queries, such as is discussed in Chen, et al., “Online Query Refinement on Information Retrieval Systems: A Process Model of Searcher/System Interactions,” Proc. of SIGIR (1990) and Hofstede, et al. “Query Formulation as an Information Retrieval Problem,” The Computer Journal, 39(4):255-274 (1996).
From the perspective of application, natural language queries/questions have been mainly used as interfaces for database systems, as discussed in Kupper et al., “NAUDA: A Cooperative Natural Language Interface to Relational Databases,” SIGMOD Record, 22(2):529-533 (1993), Androutsopoulos et al., “Natural Language Interfaces to Databases—An Introduction,” Journal of Language Engineering, 1(1):29-81 (1995), Popescu et al., “Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability,” Proc. of COLING (2004), Li et al., “NaLIX: An Interactive Natural Language Interface for Querying XML,” Proc. of SIGMOD (2005), and Delden, et al., “Retrieving NASA Problem Reports: A Case Study in Natural Language Information Retrieval,” Data & Knowledge Engineering, 48(2):231-246 (2004) or for automatic question answering systems, as discussed in Chu-Carroll et al. “A Hybrid Approach to Natural Language Web Search,” Proc. of EMNLP (2002).
Recently geographical query parsing, especially with the high demand of mobile search, has resulted in the development of the geographic query parsing track in GeoCLEF. Most of the reported work concentrates on pattern analysis using simple syntactic rules, as is discussed in Gravano, et al., “Categorizing Web Queries According to Geographical Locality,” Proc. of CIKM (2003), Jones, et al., “Geographic Intention and Modification in Web Search,” International Journal of Geographical Information Science (IJGIS), Vol. 22, p.229-246 (2008), Gan, et al., “Analysis of Geographic Queries in a Search Engine Log,” Proc. Of the First international Workshop on Location and the Web (2008), and Martins and Guillen. Semantic tagging of web queries as discussed in Manshadi, et al., “Semantic Tagging of Web Search Queries,” Proc. of ACL-IJCNLP (2009) and Li, et al., “Extracting Structured Information from User Queries with Semi-Supervised Conditional Random Fields,” Proc. of SIGIR (2009) most closely relates to the disclosed subject matter of the present application.
A so-called local search can involve specialized Internet search engines that allow users to submit geographically constrained searches, usually against a structured database of local business listings. Typical local search queries include not only information about “what” the site visitor is searching for (such as keywords, a business category, or the name of a consumer product) but also “where” information, such as a street address, city name, postal code, or geographic coordinates like latitude and longitude. Examples of local searches include “Hong Kong hotels”, “Manhattan restaurants”, and “Dublin Hertz.”
As discussed in Lafferty, et al., “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data,” Proc. ICML (2001), conditional random fields can be used as a framework for building probabilistic models to segment and label sequence data, offering advantages over hidden Markov models and stochastic grammars which have been used in linguistics for a wide variety of problems in text and speech processing, including topic segmentation, part-of speech (POS) tagging, information extraction, and syntactic disambiguation, such as is discussed in Manning, et al., Foundations of Statistical Natural Language Processing,” Cambridge Mass.: MIT Press (1999).
Conditional random fields can relax strong independence assumptions made in those models, and also avoid limitations of maximum entropy Markov models (“MEMMs”) and other discriminative Markov models based on directed graphical models, which can, e.g., be biased towards states having few successor states. Hidden Markov models (“HMMs”) and stochastic grammars, generative models, e.g., assigning a joint probability to paired observation and label sequences, and typically trained to maximize the joint likelihood of training examples, require an impractical representation of multiple interacting features or long-range dependencies of the observations, since the model inference problem is intractable.
Maximum entropy Markov models (“MEMMs”) are conditional probabilistic sequence models that also attain all of the above noted advantages, as discussed in McCallum, et al., “Maximum entropy Markov models for information extraction and segmentation,” Proc. ICML 2000 (pp. 591-598), Stanford, Calif. (2000). However, MEMMs and other non-generative finite-state models based on next-state classifiers, such as discriminative Markov models, discussed in Bottou, L., “Une Approche Theorique de L'apprentissage Connexionniste: Applications a la Reconnaissance de la Parole,” Doctoral Dissertation, Universite de Paris XI (1991), suffer from a label bias problem, a so-called “conservation of score mass,” as stated by Bottou, which biases toward states with fewer outgoing transitions.