Within a network perimeter, there is often content that may be provided or otherwise exposed to entities outside of the perimeter. For a variety of reasons, there may be a subset of the content that is not desired to be externally exposed, or at least not exposed without a certain level of authorization. The subset of content may include confidential data, such as individual social security numbers or other personal identification, account information, confidential documents, or the like. The subset of content may also include content that reveals vulnerabilities of the data center.
Search engines may be used to search for and retrieve content. A web crawler retrieves pages or other content from a web site, indexes the data, and makes the data or pointers to the data available to a search engine. An external adverse party may, for example, search for keywords or phrases, such as “social security number,” retrieve numerous pages, and find actual social security numbers on some of the pages. Searching may be performed broadly, in the hope of discovering sensitive information or vulnerabilities somewhere, or it may be focused. A focused search may look in a specific web site, for a specific name, or for a known keyword associated with sensitive data. For example, an adverse party may search for a code name for a confidential project at a company, hoping to find a document that is intended to remain internal, but was inadvertently exposed outside of the company's perimeter.
An adverse party may search for data centers that contain vulnerabilities, or for vulnerabilities within a data center. One way this can be done is by searching for content that is indicative of a specific instance of or a type of vulnerability. For example, if a specific version of a software application or operating system is known to have vulnerabilities, an adverse party may search for documents produced with the specific version. Existence of such documents may suggest that the software version is in use at the data center. Characteristics of the documents that may indicate a corresponding software component are referred to as “fingerprints” of the component or vulnerability.
Sensitive data may have been deleted or restricted by the web site after it has been crawled, but may remain in the search engine's cache of content and provided to searchers. Web archive servers may crawl web sites, retrieve sensitive content, and archive the data for retrieval years after the data was removed from the web site.