Search queries are a rich source of information since they represent the collective wisdom of the crowds. People submit queries to search engines on a wide range of topics with the hope of finding useful information from web documents. Since most web vendors currently support keyword searches for their products, many users also submit queries to web vendors in order to find relevant products. Examples of queries include—“hashbrown potato casserole recipe”, “batik beaded dress”, “problem with brake on Toyota truck”, “lg scoop slate”, and “Gone With The Wind”. These queries belong to the following broad classes—food, apparel, transportation, electronics and books—since they contain terms or groups of terms that are related to these classes, eg. ‘hashbrown potato casserole’ and ‘recipe’ are related to the food class, ‘batik’ and ‘beaded dress’ are related to the apparel class and ‘brake’ and ‘toyota truck’ are related to the transportation class.
This disclosure presents a system and method to automatically recognize in queries, terms or groups of terms that belong to a particular class. For example, the system and method presented herein can be used to recognize members of the food class such as casserole, chicken meat, and broke noodle from the queries—“cassarole potluck”, “chicken meat cuts ffa skillithon” and “TGI Fridays broke noodle recipe”. One of the features described herein is the ability to distinguish between fine grained classes. For example for the queries “cassoulet recipe” and “calories in cassata”, our method will be able to recognize that cassoulet is a French entree and ‘cassata’ is an Italian dessert. Another feature described herein is the ability to recognize entities belonging to multiple classes (also called multi-class entities) as well as carryout a context sensitive recognition of multi-class entities. For example for the query “apple”, the entity recognition system will recognize apple as both a fruit and a technology company. But for the queries, “apple tree” and “apple phone”, the entity recognition system will carry out context sensitive recognition and recognize apple as a fruit in the first query and as a technology company in the second query.
Recognition of class specific information in search queries can benefit a large number of search engines and web vendors, since they can use this information to improve their search results, provide targeted advertising, extract domain specific information from query logs, and improve query expansion techniques.
The system described herein is termed an entity recognizer. The entity recognizer takes in a search query or a set of queries as well as list of user defined domain classes as input. The user defined classes are used to identify and tag groups of terms in queries that belong to the specified classes. For example, if the user provides a dessert class containing a list of dessert items (e.g. cake, ice cream, custard etc.) as input, the system will tag and return all the groups of terms in queries belonging to the dessert class.
There are three areas that are related to the topic of entity recognition in queries, namely (a) Named Entity Recognition, (b) Feature Selection and Document Classification, and (c) Query Classification using Web Knowledge.
Named entity taggers seek to locate and classify atomic elements in text into predefined categories so they seem to provide the capabilities of an entity recognizer. However, Named entity taggers are limited in several ways. First, current named entity taggers use syntactic, semantic, pattern-based, dictionary-based, linguistic grammar based and statistical model based techniques to recognize named entities from text. They cannot be used to recognize entities in queries, since queries do not follow the prescribed grammar of sentences (i.e., they do not have a subject and a predicate). Moreover, named entity taggers that depend on lexicons to recognize the instances of a particular class are limited by the size of the lexicon. Also, named entity taggers that use hand-crafted grammar-based systems require months of work by experienced linguists. In addition, statistical model based named entity taggers require large amounts of annotated training data to construct a model for domain specific classes.
Unlike named entity taggers, the entity recognition system and method described herein is able to recognize entities belonging to user defined classes in queries. Moreover, the system and method described herein does not require large amounts of training data from the user. For example, the user need only provide 10 seed instances for each class. As such, the system and method described herein significantly reduces the manual labor required to collect large amounts of training data.
Feature selection is a process commonly used in machine learning, wherein a subset of the features available from the data are selected for the application of a learning algorithm. This is an important stage of pre-processing and is one of the two ways of avoiding the curse of dimensionality and is very useful for classification of text documents containing a large number of features. With respect to entity recognition, careful selection of the set of features for a bag-of-words entity model is important. However, the problem addressed herein is fundamentally different from the problem addressed by feature selection for document classification.
In feature selection for document classification, a set of training documents belonging to user defined classes is carefully selected by the user. These documents contain information pertaining to the classes to which they belong. Additionally, the number of training documents provided for feature selection is usually very large. On the other hand, the input to the entity recognition system and method described herein is a set of 10 seed instances per class. Models are constructed using this limited training data that help recognize the class of unknown entities. The web is used as a surrogate to create the models. However, unlike the training documents provided in feature selection, the web documents obtained using the seed instances usually contain a lot of information that is not related to the user defined entity classes. In order to include features pertaining to the entity classes in the entity models, the information related to our entity classes from the web documents is carefully filtered out. Existing feature selection methods do not help in filtering out this information from the web documents because these methods assume that the document features are independent. In contrast, the inter-dependence between features is key to the method of model construction used in embodiments of the present invention.
Moreover, feature selection methods are used for classifying unknown documents containing a large number of features pertaining to one of the user defined classes. As such, these methods cannot be used for the recognition of unknown entities since, for performance reasons, the system and method described herein only addresses the web snippets for these unknown entities, and the web snippets contain very limited features.
Further, feature selection methods such as Branch and Bound, Decision Tree and Minimum Description Length are prohibitively data intensive for selecting features from text documents containing thousands of features, since they construct the set of discriminative features by adding or removing one feature at a time and checking if the feature set has improved based on certain criterion.
One feature selection method called Relief initially appears efficient enough to collect the set of discriminative features from text documents. However, the Relief method ultimately fails for entity recognition since it determines whether or not a feature is discriminative based on its presence or absence in a hit or miss instance. As a result the Relief method will miss out on discriminative features that are unique to an instance. For example, if an instance of the food class contains features such as “diet” and “cholesterol” that are not shared by any other instance, the Relief algorithm will not assign high scores to these features (even though they are discriminative), since they are not shared across instances of the same class. However, using the techniques described herein will enable these features to be found if they are present in well formed sentences containing high quality food features such as ‘fruit’. The model of the system and method described herein needs to be rich in discriminative features due to the use of the limited features present in a few web snippets to determine the class of an unknown entity. Since Relief will miss many discriminative features, it is not a good candidate for use in entity recognition.
Another popular method of document classification uses Support Vector Machines (SVMs). In Rennie et al., it has been demonstrated that SVMs substantially outperform Naive Bayes for the task of multiclass text classification. However, SVM processing also suffers from the limitations posed by the feature selection methods in that it requires a large set of carefully selected training documents belonging to user defined classes. Such a refined training set cannot be constructed by using the web documents retrieved by our seed instances, as discussed above.
There is evidence of using the Web as a surrogate for the classification of queries as well as query terms. In Chuang et al., unknown query terms are classified into a predefined taxonomy by using the documents returned by searching the query term on the web. However, this method depends on the manual categorization of query terms in a taxonomy tree and this pre-categorized vocabulary is used to construct the set of features that help in determining the class of a new query term. In order to obtain a large set of features for each class, a large set of query terms would have to be manually categorized. Using the approach described herein however, very little training data is required to automatically find a rich set of entity class features from the web, thereby reducing the manual labor involved in collecting the set of features for each class.
Broder et al. determine the topic of a query by classifying the web search results retrieved by the query. However, their methods are not practical to the problem addressed herein for various reasons. Broder et al. uses the web documents retrieved by the entire query as opposed to individual query terms. Classifying documents using the entire query can help in determining the high level class to which a query belongs, but will not help in recognizing the low level classes for the entities in the query. Also, since they retrieve information from web documents, as opposed to web snippets, their methods are only feasible for a commercial search engine that has access to the indexed web pages that can be pre-classified and retrieved quickly so as to reduce the overhead of online query classification.