In the context of using a search engine, a user describes information that he wishes to retrieve in the form of a text-based query. Typically, the search engine searches a database according to the information described in the query and returns one or more search results for the user. Statistical analysis shows that, on average, a query (e.g., “silk one-piece dress”, “mobile phone”, etc.) input by a user consists of 2.4 words. In general, the query input from the user is in the form of natural or informal text (e.g., incomplete sentences, sentences without correct punctuation) instead of a statement including “and”, “or”, “non-”, etc. Therefore, the search engine has to determine the actual intent of the user based on the content of the query, perform a search, and return the search result to the user.
As used herein, word information entropy refers to the measurement of correlation between the length of certain text content and its certainty in describing a user's intent. For example, a significant amount of information is usually needed to clarify an uncertain concept or something without much known information, and less information is usually needed to clarify something that is already known to some extent. In this respect, it can be said that the measurement of information content is equivalent to the extent of uncertainty. Therefore, information content in a query may be represented by the concept of word information entropy, such that the real intent of the user may be determined according to the word information entropy associated the query to aid in performing a search based on that query.
Typically, word information entropy is calculated with the formula of TF/IDF, in which TF represents the total times that a word occurs in a set of documents and IDF represents the number of those documents in the set of documents that include the word. A larger value of TF/IDF calculated for a word indicates that the word is of relatively higher importance, and a smaller of TF/IDF calculated for a word indicates that the word is of relatively lower importance.
TF/IDF may be used to calculate word information entropy for a long text (e.g., a document with a large number of words). A query typically comprises a short text. Since a query contains only 2.4 words on average, and a query seldom includes more than one occurrence of a word, the words in the query are less likely to be distinguished in terms of importance by the word information entropies as calculated in the formula of TF/IDF. For example, for a query of “new mobile phone”, the common modifier word “new” and the words “mobile phone” cannot be adequately distinguished in terms of importance according to the word information entropies as calculated by the existing formula of TF/IDF.