With rapid development of the Internet, the problem of “information overload” becomes more and more serious. When people enjoy the convenience brought out by the Internet, they are also flooded with mass of information on the Internet. It is urgent to solve the problem of how to extract effective information from mass Internet data more effectively and accurately.
Currently, there are various kinds of Internet platforms. They provide large amount of data to users. Among them, there are familiar search engines, e.g. Google, Baidu, Soso; there are also interactive Q&A platforms, e.g. Zhidao, Wenwen, Answers; and also popular blog platforms, e.g. Qzone, Sina blog, etc.
All of these Internet platforms require a natural language text processing technique, i.e. extract effective information from mass data for processing. The natural language text processing is to analyze the syntax of a document, e.g. categorization, clustering, summarization, similarity analysis. Since each document is composed of words, each detailed technique in the natural language text processing necessitates comprehension of words. Therefore, how to accurately evaluate the importance of a word in a sentence becomes an important problem to be researched.
For example, as to a sentence “China has a long history, great wall and terracotta army are pride of China”, wherein the words “China”, “great wall”, “terracotta army” and “history” are obviously more important than others.
The word quality mining and evaluating is to determine a proper quality level for a candidate word. For example, there may be three levels, i.e. important, common and constantly-used. Then, important words are selected. Afterwards, common words and constantly-used words are selected. Thus, when a document is analyzed, important words may be considered firstly, common words may be taken as supplementation, whereas constantly-used words may be filtered out completely.
Currently, a word quality mining and evaluating method based on mass data is usually implemented by calculating a Document Frequency (DF) and an Inverse Document Frequency (IDF) of a word. That is to say, a word which does not appear constantly, i.e. a low frequency word is regarded as an unimportant word. But, the importance of a word cannot be determined accurately based on the DF or the IDF calculated. For example, a calculated result based on a corpus is as follows: the IDF of a word “lighten” is 2.89, whereas the IDF of a word “ha ha” is 4.76. In addition, as to non-structured data, e.g. Q&A platform data and blog data, a low frequency word may be a mistaken word, e.g. a mistaken string “asfsdfsfda” input by a user, or “Gao Qi also” (segmented from a sentence “Gao QI also has hope to the new dynasty”).
In addition, during document categorization, feature value methods such as Information Gain (IG) and χ2 are usually used to evaluate the contribution of a word to a category. However, only features whose values ranking in the first n will be selected as effective features, wherein n is an integer and may be selected according to a word quality mining and evaluating requirement. Then, a category weight is calculated based on TF-IDF, wherein TF denotes Term Frequency. The methods based on IG and χ2 are only used for selecting a feature word. They work well with respect to structured and little amount of data. But, with respect to mass unstructured data, a single aspect evaluation cannot reflect the importance of a word completely and cannot calculate the importance of the word effectively. For example, based on the same corpus, the χ2 of word “of” is 96292.63382, whereas the χ2 of“Jingzhou” is only 4445.62836. However, it is obvious that the word “Jingzhou” whose χ2 is lower is more important.