Languages can be divided into two types according to whether or not they have word dividers. One type has word dividers, such as English, German, and many other European languages. Generally, spaces between the words serve as word dividers. The other type has no word dividers for marking the words in a sentence. Many East Asian languages such as Chinese, Japanese, and Korean are non-divider-marked languages.
Search engine, machine translation, and phonetic synthesis applications involve language text processing problems that often require segmenting the text of a given non-divider-marked language and forming a segment series comprising segments from a sentence. The segmentation process often involves a word segmentation lexicon, which includes a database/dictionary comprising a considerable number of pre-stored entries. During word segmentation, a given text is matched against the entries of the word segmentation lexicon according to a particular strategy (e.g., a forward maximum matching method from left to right, a backward maximum matching method from right to left, a minimum segmentation method, etc.). For example, in a maximum matching method, if the longest entry that can match the input text is found in the lexicon, then it is identified as a word, and the identified word is regarded as a segment. Proceeding in this manner, one can segment the given text into a segment series composed of segments. The segments may include successfully matched words as well as characters or dynamically identified words.
For a given piece of text, the longer the segments in a resulting word segment series (i.e., the smaller the number of segments contained in the segment series), the greater the word segmentation granularity. Conversely, the larger the number of segments in a resulting word segment series, the smaller the word segmentation granularity. For example, for the given text “” [“The People's Republic of China was established”], the fine-grained word segmentation result is “----” [“China-People's-Republic-establish-ed”], and the coarse-grained word segmentation result is “--” [“People's Republic of China-establish-ed”].
Different applications have different requirements concerning the granularity levels of segmentation results. For example, in machine translation, granularity should be somewhat larger, e.g., “” [“business management”] should be a single segment. But in the index system of a search engine, “” would generally be divided into two segments (“” [“business”] and “” [“management”]).
Granularity level requirements concerning segmentation results can vary even for the same type of application. The example of search engine applications is used below for the purpose of explanation. In search engine applications, search engines require different word segmentation granularities for different fields. For example, for search engines used in the field of e-commerce (e.g., for making product searches), both sellers and buyers demand higher recall rates in their searches. To accomplish this, the search system needs to have smaller index granularity and accordingly requires finer-grained segmentation results. For search engines used for general web page searches, search precision becomes particularly important to users because of the vast quantity of Internet web pages. To accomplish this, the search system requires coarser-grained segmentation results. Search recall rate and search precision are thus important measures for evaluating search quality. The search recall rate, which measures how well the system finds relevant information, is the ratio of relevant documents found to the total number of relevant documents. Search precision, which measures how well the system performs in finding relevant information, is the ratio of relevant documents found to all documents found. Word segmentation granularity relates to search recall rate and search precision. Generally speaking, the smaller the word segmentation granularity, the higher the search recall rate; the larger the word segmentation granularity, and the higher the search precision.
Granularity level requirements concerning segmentation results even vary according to different use stages within the same field of the same type of application. Again, we use web search engine applications as an example for purposes of illustration. In order to meet user requirements with respect to both search recall rate and search precision, granularity level requirements will differ between the index stage and the sequencing stage of search. In the index stage, finer-grained segmentation results are required so that a sufficient number of web pages may be located. In the sequencing stage, coarser-grained segmentation results are required so as to meet the need for search precision and to avoid providing the user with irrelevant web pages.
To solve the problems described above, the prior art mainly employs two schemes for providing segmentation results having multiple levels of granularity:
FIG. 1A illustrates a typical scheme for providing segmentation results with multiple levels of granularity. First, minimal-grained word segmentation is performed. Then, a bottom-to-top dynamic merge is conducted. Specifically, a finer-grained word segmentation lexicon A is used to perform word segmentation on a given text. Different segment series can be generated in the word segmentation process. For example, the text S1S2S3S4S5S6S7 (where Sn represents a character) can be divided into S1S2S3S4S5S6S7 or S1S2S3S4S5S6S7. Then one of the segmentation series—let us assume here that it is S1S2S3S4S5S6S7—can be selected as the optimal segment series according to a preset selection algorithm. The preset algorithm can be an algorithm based on a statistical model.
In order to provide coarser-grained segmentation results, a merge is performed on the series S1S2-S3S4-S5-S6S7. The specific merge process requires assessing whether a combination of two segments in the series SaS2-S3S4-S5-S6S7 matches entries in word segmentation lexicon B, which contains longer entries. If these two segments are merged, a merged, coarser-grained segment series will result. Let us assume here that S1S2 and S3S4 can be merged and that S5 and S6S7 can be merged, in which case the merged, coarser-grain segment series will be S1S2S3S4-S5S6S7.
If this method is used, some semantic items will be lost during word segmentation. For example, the semantic elements S1S2S3 and S4S5 will be lost. We will use an actual example here for the purpose of illustration. The text is “—” [“This stainless steel tube is cast using grade 1 steel”], wherein “” [“stainless steel tube”] in fact contains two semantic items: “” [“stainless steel”] and “” [“steel tube”]. If we segment “” [“stainless steel tube”] at the minimum granularity into “” [“stainless steel-tube”] (where “-” separates two adjacent segments) and then merge them again to form “” [“stainless steel tube”], then we lose the semantic item “” [“steel tube”]. Consequently, the term “” [“steel tube”] will not be found during the search for this text. If we segment “” [“stainless steel tube”] at the minimum granularity into “” [“none-stain-steel tube”] and then merge them again to form “” [“stainless steel tube”], then we lose the semantic item “” [“stainless steel”]. “” [“stainless steel”] is therefore not found during the search for this text.
In addition, it is difficult to ensure merging precision. Assuming that the segment series obtained from minimum-granularity word segmentation of the given text is “----—--” [“this-stainless steel-tube-using-grade 1-steel-cast”], ambiguities will be encountered during the merger. The merged result may be “” [“stainless steel tube”] or “” [“useful”]. If the segment series obtained from minimum-granularity word segmentation of the given text is “--—--” [“this-stainless steel-useful-grade 1-steel-cast”], then it cannot be merged again to obtain the semantic item “” [“stainless steel tube”].
FIG. 1B illustrates another typical scheme for providing segmentation results with multiple levels of granularity. First, maximum-grain word segmentation is performed. Then, segmentation from top to bottom is performed. In particular, a coarser-grained word segmentation lexicon C is used, and a model and algorithm are used to perform dynamic word segmentation of a given text S1S2S3S4S5S6S7 (select the optimal segment series) to obtain the segment series S1S2S3S4-S5S6S7.
To obtain a finer-grained word segmentation result, each semantic element in S1S2S3S4-S5S6S7 is segmented again. The specific segmentation process is to assess each segment in the series S1S2S3S4-S5S6S7 to determine whether it contains two or more other finer-grained entries in word segmentation lexicon C. If it does, then this segment is sub-divided into two or more other entries. Let us assume that S1S2S3S4 can be sub-divided into S1S2 and S3S4 and that S5S6S7 can be sub-divided into S5 and S6S7, in which case the finer-grained word segmentation result obtained after cutting would be S1S2-S3S4-S5-S6S7.
If this method is used, a greater number of coarse-grained entries will need to be recorded in the lexicon in order to solve the problem of ambiguities occurring during maximum-grain word segmentation. For example, given the text of “” [“business management science and technology”], if the coarser-grained entries “” [“business management”] and “” [“management science”] are recorded in the lexicon, then “” [“business management science”] may be segmented into “-” [“business management-science”] or “-” [“business-management science”]. The solution to this ambiguity is also to record an even longer entry, “” [“business management science”] in the lexicon. However, “” [“business management science”] will also give rise to a segmentation ambiguity with respect to “” [“science and technology”]. Thus, such a set composed of coarse-grained entries is not a closed set. Expansion of the lexicon will create difficulties for lexicon maintenance.
As can be seen, the greater the granularity of the entries in a word segmentation lexicon, the greater the number of different segment series that will be generated during word segmentation. That is, there will be more word segmentation paths and thus more ambiguity problems. It will be difficult to ensure the precision rate of maximum-grain segmentation.
When there are maximal-grained segmentation results, the fine-grained words of these segments can be obtained by checking the lexicon. However, as a lexicon expands, manual maintenance of these entries and the fine-grained words of these entries, while maintaining the quality of entries, can be costly.
In summary, the prior art for providing segmentation results with multiple granularity levels typically experiences the problem of low recall rates, which results in lost semantic items, or the problem of overly-vast word segmentation lexicons and low word segmentation processing precision.