This disclosure relates generally to web crawling in a data processing system and more specifically to task assignment for web crawling in the data processing system.
Known prior art describes designing efficient ways to crawl websites and web applications in a distributed environment, which has been the topic of extensive research for over 20 years. Many efficient and scalable solutions are offered using distributed platforms. At the core of a distributed crawling algorithm is a partitioning algorithm through which a mapping function is used by crawlers to determine to avoid or reduce duplication of work among the crawlers while performing required tasks. A task is often visiting a page and performing some computation on the page visited, for example crawling for indexing or crawling to perform security testing. Known prior art states that two main categories of partitioning algorithms are static assignment and dynamic.
Static assignment is an approach in which each worker in a set of homogeneous workers is each allocated a unique ID. The mapping function maps each task to one of the unique assigned IDs in the system. Upon encountering a task a crawler examines the task and decides whether the task falls under jurisdiction of the crawler or the task belongs to another node. When the task falls under jurisdiction of the crawler, the node takes care of the task autonomously. When the task does not fall under jurisdiction of the crawler, the current node informs the node responsible for the task.
Different proposals suggest different matrices and algorithms to derive the mapping function. Typical approaches use parameters including a hash of the universal resource locator (URL), a geographical location of the server, and a URL hierarchy. Further, different approaches represent different trade-offs between overhead associated with duplication of work and use of communication by known prior art.
Dynamic assignment is another approach in which one or more centralized control units track discovered tasks and executed tasks. Upon discovering a task, each node informs the centralized control units which add the discovered task to their respective queues. Known prior art shows that during crawling, all nodes constantly ask the centralized control units for workload.
Known prior art roughly follows the architecture in which a centralized unit called a URL server stores a list of URLs and orders a set of slave nodes to download each URL. All downloaded pages are stored in a unit called store server. The retrieved pages then are indexed in a distributive manner. Known prior art shows that downloading tasks and the indexing tasks require centralized units of coordination.
There also exists a third category in which nodes work independently without any coordination and task partitioning. Known prior art algorithms in this category typically either do not guarantee the full coverage of tasks, or may suffer from work duplication.
With reference to FIG. 1 a communication pattern and work assignment of a static assignment model is illustrated. A set of nodes 100 comprises nodes 102-110 in which each node has a communication path to each other node in set of nodes 100. Each node is therefore capable of exchanging information with each other node in the set. The communication links become more complex as the number of nodes in the set of nodes increases. Network traffic typically becomes an issue as the number of nodes increases. Although nodes 102-110 are shown, the number of nodes is not limited to five and can be extended to n number of nodes as needed and within resources available.
With reference to FIG. 2 a communication pattern and work assignment representative of a dynamic assignment model is presented. Set of nodes 200 comprises nodes 202-210 and central unit 212. Each node in set of nodes 200 communicates with central unit 212. Traffic between nodes does not occur as in set of nodes 100 of FIG. 1. Each node in set of nodes 200 relies on central unit 212 for assignment of tasks. Although nodes 202-210 are shown, the number of nodes is not limited to five and can be extended to n number of nodes as needed.
Dynamic assignment of tasks, as in FIG. 2, creates a natural and typically load balanced policy, in the process, because working nodes only ask for new workload when the working nodes are free, thus no working node becomes a bottleneck. Static assignment of tasks, however, may lead to one node or a set of nodes becoming a bottleneck. Theoretically randomization created by good mapping functions can distribute tasks equally among the workers. Practically however indeterministic behavior of a system created by factors including network delays, operating system scheduler, and server respond time increases this gap. Newly emerging technologies such as cloud environment and heterogeneous computing increase the difficulty in creating a correct mapping function to achieve ideal load balancing.
Static assignment of tasks, as in FIG. 1, enjoys a peer-to-peer architecture, which helps to avoid any single unit becoming a bottleneck and therefore provides an opportunity for a scalable solution. The dynamic assignment approach, as in FIG. 2, requires central units to track all discovered tasks and visited tasks. The central units can accordingly become bottlenecks and lead to scalability issue.