Field of the Invention
Embodiments of the present invention relate to classifying content, and more specifically to searching for one or more predetermined N-grams in a string of bytes representing content written in a non-delimited language.
Description of the Related Art
Today, many entities (e.g., private companies, government, schools, etc.) rely on various content filtering mechanisms to manage and/or control user access to the Internet via facilities provided by the entities. For example, a company typically implements some form of content filtering mechanism to control the use of the company's computers and/or servers to access contents (e.g., web pages and/or emails) from the Internet. Contents as used herein broadly refer to expressive work, which may include one or more of literary, graphics, audio, and video data. Access to content within certain predetermined categories using the company's computers and/or servers may not be allowed during some predetermined periods of time.
Conventionally, a content rating engine or a content classification engine may be installed in a firewall to screen contents coming into a system from an external network, such as email received and web pages retrieved from the Internet. The content rating engine may retrieve rating of the incoming contents from a rating database, if any, and/or attempt to rate the contents in real-time. To rate the content in real-time, the content rating engine may parse the contents to identify some predetermined keywords and/or tokens and then determine a rating for the contents based on the presence and/or absence of the keywords and/or tokens.
For European languages (e.g., English, French, etc.), the spaces between words are often used as delimiters for recognizing word boundaries. Therefore, words in European languages can be readily tokenized and searched using the spaces between the words. As a result, tokenization generally proceeds efficiently for European languages.
However, the above approach typically fails for languages that lack spaces between words, such as Chinese, Japanese, Thai, etc. Such languages are also referred to as non-delimited languages herein. For example, a Chinese sentence is composed of words, which contain a variable number of characters, with no spaces indicating the word boundaries. Below is an example of an excerpt from a Chinese newspaper: “, ” The words are  (now) (one)  (week)  (ago)  (Iranian)  (government)  (began)  (implementing)  (energy)  (rationing) . . . . Note that the characters are not separated by spaces, and a word may include one or more characters. Other examples can be more complicated, with ambiguous sentences where the correct split of text into words can be found only by understanding the context. As a result, spaces may not be reliably used as delimiters for recognizing words in Chinese. Because of the above issue, keyword search in non-delimited languages is typically difficult and time consuming. This is particularly problematic in real-time or on-the-fly content filtering because the keyword search has to be limited to avoid causing a noticeable delay in online content access.