A modern organization typically maintains a data storage system to store and deliver records concerning various significant business aspects of the organization. Stored records may include data on customers (or patients), contracts, deliveries, supplies, employees, manufacturing, or the like. A data storage system of an organization usually utilizes a tabular storage mechanism, such as relational databases, client/server applications built on top of relational databases (e.g., Siebel, SAP, or the like), object-oriented databases, object-relational databases, document stores and file systems that store table formatted data (e.g., CSV files or Excel spreadsheet files), password systems, single-sign-on systems, or the like.
Tabular data stores are typically hosted by a computer connected to a local area network (LAN). This computer is usually made accessible to the Internet via a firewall, router, or other packet switching devices. Although the accessibility of a tabular data store via the network provides for more efficient utilization of information maintained by the tabular data store, it also poses security problems due to the highly sensitive nature of this information. In particular, because access to the tabular data store is essential to the job function of many employees in the organization, there are many possible points of potential theft or accidental distribution of this information. Theft of information represents a significant business risk both in terms of the value of the intellectual property as well as the legal liabilities related to regulatory compliance.
Existing security techniques typically monitor messages sent by employees of an organization to outside recipients to prevent loss of sensitive information. In particular, existing security techniques usually separate a message into tokens and determine whether any subset of these tokens contains sensitive information. The tokenization process works well with word-based natural languages that provide visible delimiters (e.g., spaces and punctuation marks) between words. Word-based languages include languages utilizing the Roman alphabet (e.g., English, French, etc.), the Arabic alphabet (e.g., Arabic, Persian, etc.), the Cyrillic alphabet (e.g., Russian, Serbian, Bulgarian, etc.), etc. Character-based languages, however, do not provide visual delimiters between words. For example, Chinese and Japanese do not visually separate words. Rather, the reader is required to understand from the context where in a string of characters one word ends and the next word begins. In addition, character-based languages typically include thousands of characters and require support for multiple alphabets.
Current mechanisms for tokenizing content in character-based languages usually rely on dictionaries containing lists of known words in specific character-based languages. However, this approach is ineffective with names because each name in a character-based language such as Japanese or Chinese can be represented by any random combination of characters. Confidential information typically includes the name of an individual and his or her data identifier such as the social security number, credit card number, employee number, etc. Hence, there is a need for an efficient mechanism to protect confidential information that includes data in a character-based language.