Big data is a phrase given to data sets that are so large or complex that traditional data processing applications cannot adequately process the data. Data sets that are considered big data typically include voluminous amounts of structured, semi-structured and/or unstructured data that have the potential to be data mined for information. Big data is playing more and more of a critical role in driving rapid business growth. Nowadays, most enterprises and organizations have realized the significance of big data and started to investigate proper approaches to leveraging big data for various purposes. However, before big data can be leveraged and analyzed to derive value, it has to be captured and stored.
Among the diverse sources of big data, the fast expanding World Wide Web (referred to herein simply as the web) connected by the Internet is an extremely important source and is of great interest to big data advocates. Many commercial and research institutions run their web crawling systems (web crawlers) to capture data from the web. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine. Moreover, web crawlers are used in many other applications that process large numbers of web pages, such as web data mining, comparison shopping engines, and so on.