Networks, such as the Internet, have become an increasingly important part of our everyday lives. Millions of people now access the Internet on a daily basis to shop for goods and services, obtain information of interest (e.g., movie listings, news, etc.), and communicate with friends, family, and co-workers (e.g., via e-mail or instant messaging).
Currently, when a person wishes to purchase a product or simply find information on the Internet, the person enters into his/her web browser a Uniform Resource Locator (URL) pertaining to a web site of interest in order to access that particular web site. The person then determines whether the product or information of interest is available at that particular web site.
When the person does not know where to go to find, for example, a desired product, the person may “search” for web sites that sell the product using a search engine. For example, suppose a person wishes to purchase a laser printer via the Internet. The person may access a web site that includes a conventional search engine. The person may enter one or more terms relating to the product, such as “laser printer,” into the search engine to attempt to locate web sites that sell that product. Searching for products or information of interest in this manner has become very popular. As such, companies often desire to have their web site(s) listed very highly in search results, thinking that a highly ranked listing will result in increased sales.
Many techniques exist that allow companies to obtain a highly ranked listing. For example, some search engines allow companies to buy certain search terms. If a search query is received with those search terms, then the company that has purchased those search terms may be ranked more highly than other companies offering the same product. In other situations, a webmaster for a company may attempt to “trick” the search engine into listing the company's web site more highly.
For example, one of the most deceptive techniques that webmasters use to trick a search engine is called “cloaking” In this situation, a webmaster causes a different document to be displayed to users than what is presented to search engine spiders. Webmasters may attempt to hide text and/or links from users, but not from search engine spiders, in order to cause their documents to be ranked more highly than their competitors. When hiding text, webmasters may make the text color the same as or similar to the color of the background. Therefore, the text is not visible to a user viewing the document, but would still be considered by search engines that rank documents based on words in the document. A related trick is to use an image that is the same or very similar in color to the text that the webmaster wants to hide. The image can be a background image or other types of images. For example, a webmaster may place a small blue bar image in the middle of the displayed document with blue text mostly on top of or underneath the image.
One technique for hiding links involves the use of a very small image (e.g., a 1×1 pixel graphic interchange format (GIF)) that is used as a hyperlink. The image can be made to be so small that the image is not visible to users viewing the document, but may still be considered by search engines when ranking documents. In other situations, large images (e.g., 300 pixels wide and 200 pixels high) that are hyperlinks may be used that are the same color or similar color to the background.
Webmasters also use Cascading Style Sheets (CSS) and JavaScript to hide text and links in a document. For example, CSS allows webmasters to mark a block of text as “hidden.” Text in a document can also be set to a font size of 1 pixel high, for example, so as not to be visible to viewers of the document. CSS also allows text to be positioned using absolute numbers/spacing. Therefore, webmasters can position text or links to the left/right or above/below the visible area. CSS allows for layers of elements to be presented to users. For example, the “Z-ordering” of an element (e.g., text) can be set such that the layer with text is obscured below a visible layer. Webmasters may also use JavaScript to dynamically modify a document so as to include one of more of the tricks described above. JavaScript can also be used to dynamically modify a document by removing original content from a document and replace it with new content. Webmasters may store the JavaScript and CSS in external files, which search engine spiders normally do not fetch. This makes detection of these tricks more difficult.
Therefore, there exists a need for systems and methods for improving the ability to detect hidden items in a document.