The exemplary embodiment relates to natural language processing and finds particular application in the processing of queries used to retrieve information from a database.
Query processing is widely performed as an initial step in various domains, such as information access and data analytics, to improve retrieval performance. Typically, the processing involves such things as simple stop word removal and stemming. Many Natural Language Processing (NLP) tools, while widely used for processing complete sentences in a given natural language are, however, inappropriate for processing of queries. Consider, for example, a French language query “coupe apollon.” While with standard text processing techniques, “coupe” would be likely to be identified as a verb (corresponding to “cut” in English), in the context of this query it should be tagged as a noun (“cup”). Full sentence analysis methods, while useful for sentences where the grammar and syntactic structure follow those of the given language, are not designed to cope with the freeform structure and misspelling often associated with queries.
To address the processing of queries, several approaches have been proposed for customizing some of the components for query structures. Advanced parsing techniques that are able to treat queries as a collection of phrases rather than single terms have been proposed (see, for example, Josiane Moth, et al., “Linguistic Analysis of Users' Queries: towards an adaptive Information Retrieval System,” Int'l Conf. on Signal-Image Technology & Internet-Based Systems, Shanghai, China, 2007). Morphological analyzers, chunkers, and named entity recognizers are also regarded as potentially useful tools in the development of a successful information access application.
However, prior attempts at query processing, often referred as structural query annotation, have generally considered capitalization, named entity detection, PoS tagging and query segmentation independently and address only one of the above issues. For example, named entity recognition has been considered independently of other query processing steps (see, Jiafeng Guo, et al., “Named entity recognition in query,” in Proc. 32nd Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'09) pp. 267-274 (2009); Marius Pasca, “Weakly-supervised discovery of named entities using web search queries,” in Proc. 16th ACM Conf. on Information and Knowledge Management (CIKM '07) pp. 683-690 (2007); and Dou Shen, et al., “Personal name classification in web queries,” in Proc. Int'l Conf. on Web search and web data mining (WSDM '08) pp. 149-158 (2008)).
Work on query segmentation has been based on the statistical interaction between a pair of query words to identify the border between the segments in the query (see, Jones, et al., “Generating query substitutions,” in Proc. 15th Int'l Conf. on World Wide Web (WWW '06), pp. 387-396 (2006); Guo et al., “A Unified and Discriminative Model for Query Refinement,” Proc. SIGIR'08, pp. 379-386 (2008).) The segmentation proposed by Bergsma and Wang uses a machine-learned query segmentation system trained on a small, manually annotated set of queries (see, Shane Bergsma, et al., “Learning Noun Phrase Query Segmentation,” Proc. 2007 Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 819-826 (2007)).
PoS tagging is used in many tasks in information analytics, such as query reformulation, query segmentation, and the like. However PoS tagging has generally not been considered to be adapted for queries themselves. Allan and Raghavan consider that PoS tagging may be too ambiguous for short queries and propose to interact with the user for disambiguation (see James Allan and Hema Raghavan, “Using part-of-speech patterns to reduce query ambiguity. In Proc. SIGIR'02, pp. 307-314 (2002). In the method of Barr, et al., a set of manually annotated queries is produced and then a Brill tagger is trained on this set in order to create an adapted PoS tagger for search queries (see, Barr, et al., “The Linguistic Structure of English Web-Search Queries,” Proc. Conf. on Empirical Methods on Natural Language Processing (EMNLP'08), pp. 1021-1030, October 2008.
Bendersky, et al., proposes applying probabilistic models for capitalization, PoS tagging, and query segmentation, independently. The models rely on a document corpus rather on the query itself (Michael Bendersky, et. al., “Structural Annotation of Search Queries Using Pseudo-Relevance Feedback,” Proc. CIKM'10, pp. 1537-1540 (2010)). Such an approach is not generally applicable since most content providers do not provide access to their document collection. Moreover, the query expansion which is central to this approach is not possible for most digital libraries that are organized in a database.
A system and method are disclosed for processing queries which enable Natural Language processing tools to be utilized more effectively in query processing.