The Internet is a world wide system of computer networks. One of the most popular parts of the Internet is the World Wide Web (“www”), which contains hypermedia content accessible to tens of millions of people worldwide. The hypermedia content is usually organized as web pages formatted by the Hypertext Markup Language (“HTML”). The HTML documents, i.e., web pages, and other media are transmitted from web servers through the networks using the Hypertext Transfer Protocol (“HTTP”). The web page contains embedded references to resources such as images, audio, video, documents, or other web pages. By selecting “hyperlinks” or “links” on the web page, a user can access resources that are embedded in the web page being browsed.
Web pages can be either static or dynamic. A static web page contains fixed content and the content is not changing according to any user requests. A dynamic web page contains dynamically-generated content. Based on a user's request, a web server can return a dynamic web page containing content that dynamically generated from information stored in a database. Because of the large number of web pages and web servers on the Internet, information is often poorly organized. It can be difficult for users to locate particular web pages that contain information interesting to them. There are search engines and other web-based intelligent service systems developed to go through a large number of web pages to retrieve and organize the information. The engines and systems generally include at least one “miner process” (also referred to as “miner”, “crawler”, “spider”, “robot”) that fetches the web pages on different web sites as a means of retrieving up-to-date information.
To maintain the quality and freshness of the information that a system is organizing, it is common to utilize thousands of miner processes at any given time. Often the miner processes are fetching HTTP Content such as web pages, images and the like. It is common that there are concurrent fetches by different miners to the same website, trying to retrieving information from the same or different web pages. Each miner connects to a web site via at least a three-step negotiation process. The actual request from the miner for the content can only be issued after the negotiation process is complete. When used in conjunction with many requests for web pages, the negotiation process can cause a significant impact on the performance. The most common type of content fetched by miners is small content, such as a webpage or a small image. For these requests of small contents, there are usually only a single step for a request and another single step for a response after the 3-steps negotiation process. This means that the miner uses more network resources negotiating the connection than it does getting the content. If the miner communicates with the server using a secure connection, such as Secure Socket Layer (“SSL”) or Hypertext Transfer Protocol Secure (“HTTPS”), even more network resources will be spent negotiating the connection.
Furthermore, some websites and servers try to block robotic requests including miner requests for various reasons, such as reducing number of connections to the web server, or a deliberate attempt to prevent the miner from retrieving a large amount of information from the web site. Currently very limited approaches can be employed to work around the blocking. These work-around approaches usually cause even slower and less efficient requests. Moreover, less efficient miner approach will use excessive amount of connections, resulting in an even higher likelihood of miner being blocked. Therefore, there is a need for an efficient approach of establishing and routing miner connections.
The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent upon a reading of the specification and a study of the drawings.