The exploding growth of the Internet poses an enormous challenge in detecting malicious web pages. A very large number of websites exists, and many more new sites and pages are being launched on a regular basis. It is a huge job to analyze all of the pages of all of these websites to detect those that are malicious (e.g., those that distribute malicious code such as computer viruses, worms, Trojan horses, etc. and those that attempt to glean personal information for malicious purposes such as identity theft). Yet, the growing Internet customer base demands full, expanded Internet coverage without sacrificing efficacy of malicious site detection. Scanning every web page of the Internet is thorough, but very slow and expensive, considering the vast number of web pages, and the frequency with which they change. Various algorithms exist for performing a full-scan of the Internet, such as, for example, scanning in a conventional breadth first order (i.e., first scan all root pages, then scan the pages at the first level of embedding, then at the second level, etc.). However, a conventional full-scan of the Internet (e.g., according to a breadth first or other conventional order) for detecting malicious pages simply leaves too many pages unanalyzed at any given time.
Focused crawling is a conventional technique which is used for mining specific data from web pages. In conventional focused crawling, web pages are analyzed out of breadth first order, based on certain pages containing desired content (i.e., the content of interest to the data mining application). The content of the sites is used to determine the focus (i.e., the order in which to analyze sites outside of a breadth first crawl). Typically, focused crawling efforts target desired content, located within benign parts of the Internet, which is not relevant to the maliciousness of a site. Some effort has been made to use focused crawling to identify malicious sites based on the presence of certain content based features of individual web pages. However, this type of focused crawling still requires at least a preliminary look at the content of the pages themselves, in order to determine which pages have content indicative of maliciousness and thus worthy of priority analysis. Conducting even a preliminary analysis of the content of each web page is expensive. Additionally, the content based features indicative of a site being possibly malicious can vary with a frequency beyond that which can be matched with manually updated analysis tools, as the malicious parties behind these sites regularly change their strategy to remain undetected. Thus, efforts to identify sites indicative of maliciousness based on site content are only as effective as they are current, and the factors indicative of malicious content change frequently.
It would be desirable to address these issues.