Word segmentation is one of basic issues of natural language processing. Word segmentation needs to be performed on all languages (such as Chinese, Japanese, and Arabic) with no word boundary mark. A word segmentation system is widely used in fields such as information retrieval, machine translation, and a question answering system.
Different applications have different requirements on an output of a word segmentation system. For example, an information retrieval system requires a relatively high word segmentation speed and consistency. However, the information retrieval system requires relatively low word segmentation correctness, for example, requires a relatively low recognition rate for an unrecorded word (a word that is not recorded in a word segmentation system). On the contrary, a machine translation system requires relatively high word segmentation correctness, but requires relatively low word segmentation consistency. For example, a string “Jiang Wenyuan” is an unrecorded word. In an information retrieval application, if a word segmentation system segments “Jiang Wenyuan” into two words “Jiang” and “Wenyuan”, instead of one word “Jiang Wenyuan”, the information retrieval system can find a related document by means of retrieval provided that the word segmentation system ensures that all “Jiang Wenyuan” in the document is segmented in a same manner. In comparison, in the machine translation system, if the string “Jiang Wenyuan” is segmented into “Jiang” and “Wenyuan”, the word “Jiang” may be incorrectly translated into an English word “ginger”. Consequently, a translation result of the machine translation system is inaccurate.
All current word segmentation systems can meet a requirement of only a particular application, and are difficult to be reused in different application scenarios. In consideration that some companies and organizations in the industry need to use a word segmentation system in different application scenarios, a solution usually used by the companies and the organizations is customizing different word segmentation systems for different applications. This manner causes resource waste and a system maintenance difficulty.