When multiple users communicate by exchanging messages online, traces of messages left on end devices are susceptible to siphoning by unauthorized third parties, either from accessing communication devices or a chat server to carry out forensic audit and investigations.
User messages may involve rich data types, including but not limited to, plaintext, pictures, video, audio, and markups.
User messages may also comprise of send timestamp, receipt timestamp, sender personal identifying data, recipient personal identifying data, sender device tracking data, recipient device tracking data, sender online preferences, and recipient online preferences.
User messages may also comprise of sensitive data that require authorization from data subject other than the sender and the recipient, where such sensitive data may include password, third party identifying data, and trade secrets.
Traces of messages left on end devices are susceptible to both internal and external threats, including but not limited to: stolen or lost devices, online tracking companies, Trojan horses programs, and accidental forwarding of confidential messages by the user.
User messages may also include multiple languages, and may span across a network of multiple countries and jurisdictions. User messages may be encrypted or may be encoded.
A common weakness with encoding techniques is frequency-based attack by sampling chat messages over a long period of time to obtain some of the most frequently occurred encoded values representing letters, which are then translated into a small finite dataset of the most frequently used alphabets in the language of English. Based on the dataset, the encoding scheme can very easily be decoded by applying known speech patterns of the language English.
The frequency-based attack can further compromise a conversation by means of another technique, which samples chat messages over a long period of time to identify frequently occurred phrases. Commonly used phrases that are trending in points of time tend to be very limited in numbers when sampling among all English speaking populations. The size of the dataset can even be further reduced when the social background of speakers engaged in a conversion is known, as there is a natural tendency to reuse a very small subset of vocabularies and phrases specific to a circle or profession, thereby increasing the success rate of a frequency attack.
In this application, the inventor has improved upon previous techniques by developing methods and apparatus for the protection of privacy of both end user and the real party of interest, such as an employer and a data subject. Techniques are described that provide enhanced protection against frequency attack that are either alphabets-based, or phrases-based, or both. Further, techniques are described to prevent leakage of conversation in cases of forensic discovery, for example when required by laws or by force.