1. Field of the Invention
Exemplary embodiments of the present invention relate to a system and method for collecting a document from a host, and more particularly, to a system and method for collecting a document for which updates may occur.
2. Discussion of the Background
Generally, a search service business may collect contents of documents from a plurality of sites that exist on the Internet, using a web robot. The web robot may collect the contents included in the documents using a crawling technique of a random access scheme. The search service business may randomly extract a Seed uniform resource locator (URL), and may collect documents using the web robot based on the extracted Seed URL. In some cases, the collected documents may be unrelated to the URL of the collected documents.
When a document is collected by the above-described method, random access of the web robot may cause a problem in that information overload may occur at a host of a website. Also, due to random collection performed by the web robot, the search service business may provide, as part of a search result, documents unrelated to a search request. Accordingly, the search service business may experience difficulty in analyzing a result of collecting documents because of the unrelated URL and document contents.
In view of the foregoing, there is a need for a system and method for collecting an accurate web document without causing overload for a host of a website.