1. Field of the Invention
The present invention relates generally to an improved data processing system and in particular, to a computer implemented method and apparatus for protecting sensitive content. More particularly, the present invention is directed to a computer implemented method, apparatus, and computer usable program product for obfuscating sensitive content to prevent detection and collection by data collection bots.
2. Description of the Related Art
The World Wide Web, also referred to as the Web, is a distributed information retrieval system in which web pages formatted in Hypertext Markup Language (HTML) are linked via a Hypertext Transfer Protocol to other web pages. These web pages contain information such as, for example text, audio, video, and graphic files, and may be accessed by implementing a web browser.
Some users of the Web may elect to post sensitive content. Sensitive content is often targeted for collection by malicious Web users. In many instances, sensitive content is personal identifying information, such as, for example, a name, telephone number, email address, social security number, or screen name. Sensitive content may also include images, videos, or any other information that may be presented on a web page or accessible via the Internet. The sensitive content may be posted on a social networking web site, a discussion forum, an online auction site, or any other web page.
Because the Web includes several billion web pages, searching web pages for sensitive content is a tedious and time consuming task. Consequently, malicious users have employed data collection bots to scan web pages for automatically detecting and collecting sensitive content. Data collection bots are applications that scan web pages to automatically detect and collect sensitive content. To prevent these data collection bots from detecting sensitive content, some Web users have obfuscated the sensitive content by implementing obfuscation algorithms. Obfuscation algorithms are sometimes referred to as CAPTCHA™ tests. CAPTCHA is an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart.” CAPTCHA tests often require a modicum of human intelligence to solve. For example, some CAPTCHA™ tests obfuscate content to form obfuscated content by distorting the text or eliminating spaces between the letters to prevent data collection bots from discerning individual letters. Applying obfuscation algorithms to content forms obfuscated content.
One example of a currently used obfuscation algorithm involves the generation of sensitive content that omits obvious identifiers often targeted by data collection bots. For example, a data collection bot may be programmed to identify email addresses by locating certain identifiers, such as the @ symbol in a string of characters followed by .com, .edu, or some other similar suffix. To defeat these data collection bots, the email addresses may be written to omit the identifiers. For example, a fictitious email address such as user@email.com may be posted as “user <at> email <dot> com”. However, data collection bots may evolve over time to detect these variations of presenting email addresses and other similar types of sensitive content.
Another currently used obfuscation algorithm involves creating images of text-based sensitive content. Data collection bots searching for text-based sensitive content would be unable to recognize the information contained in the image. The evolution of data collection bots may eventually enable these data collection bots to discern text presented in an image. Data collection bots are able to evolve to circumvent existing and newly developed obfuscation algorithms because obfuscated content is often posted on the Web for long periods of time. Consequently, malicious users are able to evolve their data collection bots to circumvent the obfuscation algorithm. An improved computer implemented method, apparatus, and computer usable program product are necessary to overcome these problems.