Many languages, such as the English language, have words separated by white spaces in text. In these types of languages, any technology that requires words to be identified in text is fairly straight forward. The white spaces are known delimiters between adjacent words. These types of languages are referred to as space-delimited languages or segmented languages.
However, other languages, such as Chinese, Japanese, Korean and Vietnamese, for instance, are written simply as a sequence of evenly spaced characters. These languages do not have a clear separation between words, in that they do not have spaces between the words. These types of languages are referred to as non-segmented languages. Lack of a known delimiter in non-segmented languages makes precise detection of, for example, key words, quite difficult.
Similarly, in non-segmented languages, the exact same characters can mean different things, based upon the surrounding context. By way of example, the following text:

Has a word segmentation, which is translated in Table 1 below:
TABLE 1   ∘InputCredit-CardNumber(End-of-sentencepunctuation)
However, the following text

has the translation shown in Table 2 below:
TABLE 2    ∘Zhou Xing (adrovehis truckTo haul goods(End-of-sentenceperson's name)punctuation)
It can be seen that the text in Table 2 contains the same character sequence (highlighted) that is translated in the first example as “credit-card” but it has a completely different meaning and has nothing to do with credit cards.
In addition, in non-segmented languages, line breaks can occur in various places that make it even more difficult to identify keywords in the character sequence.
This can be problematic in a variety of different fields. For instance, there are currently a variety of different sources of policies and regulations that govern the dissemination of personal information. Organizations that deal with certain types of information are required to be in compliance with all these regulations. The regulations can be external regulations which come from the government, for example, or internal regulations that govern how certain types of information can be disseminated within a company.
Often, the content that is subject to these regulations and policies is operated on by information workers who have a handbook that contains a large volume of regulations or policies (both internal and external), and the worker is expected to know and comply with all of them. In enforcing these policies, some systems attempt to identify sensitive information in documents being worked on by the information workers. In doing so, those systems often attempt to examine words in the documents to determine whether a given document is sensitive. For instance, a keyword such as “credit card” is seen as an indication of sensitive content. However, as discussed above, this is very difficult to identify in non-segmented languages.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.