Today, various content filtering mechanisms are provided to entities to manage and/or control user access to the Internet via facilities provided by the entities. For example, a company typically implements some form of content filtering mechanism to control the use of the company's computers and/or servers to access contents (e.g., web pages and/or emails) from the Internet. Contents as used herein broadly refer to expressive work, which may include one or more of literary, graphics, audio, and video data. Access to content within certain predetermined categories using the company's computers and/or servers may not be allowed during some predetermined periods of time.
Conventionally, a content rating engine or a content classification engine may be installed in a firewall to screen contents coming into a system from an external network, such as email received and web pages retrieved from the Internet. The content rating engine may retrieve rating of the incoming contents from a rating database, if any, and/or attempt to rate the contents in real-time. To rate the content in real-time, the content rating engine may parse the contents to identify some predetermined keywords and/or tokens and then determine a rating for the contents based on the presence and/or absence of the keywords and/or tokens.
However, the above rating mechanism typically relies on delimiters between words in the contents in order to identify the keywords and/or tokens. Some major languages (e.g., Chinese, Thai, Japanese, etc.) do not have delimiters, such as spaces, between words, and thus, are referred to as non-delimited languages. Because of the lack of delimiters between words, segmenting a stream of text in such a language requires a preprocessing stage, which is language-specific and computationally intensive. For example, the following sentence may appear in a Chinese blog: . The correct split into words is:  (daughter),  (possessive particle),  (writing),  (level),  (still),  (consider),  (acceptable). With this split, the sentence means “The daughter's writing level is still considered acceptable.” Note that some words are two-character long, some are one character long, and one is three-character long. Moreover, the whole context is necessary to split it correctly. For example, one could also have split it as follows:  (daughter),  (possessive particle),  (write),  (make),  (water),  (Ping, a person's name),  (still),  (consider),  (past tense particle),  (must),  (go). With this split, the sentence means “The daughter's write make water, Ping had already considered must go,” which makes no sense. But for a computer system to detect automatically that this is nonsense, a word list is not sufficient. The computer system also needs a model of language usage. Developing and maintaining such a model is a knowledge-intensive task, and it would need to be repeated for each non-delimited language supported. Moreover, maintaining and using the model may be resource-intensive and may not be suitable for real-time applications. Thus, many conventional word-based real-time content rating mechanisms perform poorly on contents written in these non-delimited languages.