1. Field of the Invention
Exemplary embodiments of the present invention relate to a method, system and software executed by a processor associated with a non-transitory computer-readable storage medium to detect a trap of a web-based calendar page and to provide retrieval data. More particularly, exemplary embodiments of the present invention relate to generation of a regular expression based on characteristics of a web-based calendar page, to detection of a trap of web-based calendar pages through the regular expression, and to building of a retrieval database by deleting the detected web trap from a database and subsequent application of the generated regular expression.
2. Discussion of the Related Art
As the Internet access continues to increase, users increasingly depend upon Internet search engines in a quick and simple way to obtain information. For example, users connect to an Internet search engine by inputting an identifier such as a Uniform Resource Locator (URL) to an address bar of a web browser using a terminal such as a personal computer via network associated with inputting search words to obtain results related to various fields of information, such as news, knowledge, games, communities, and web pages.
As such, in order to provide suitable content for users, a provider of the Internet search engine has developed a search engine which is capable of collecting suitable web pages, indexing the collected web pages, and providing retrieval results to the users based on the indexed web pages. In particular, web crawlers are mainly used to index the World Wide Web in a methodical and automated manner.
As an operating method of the web crawler, the web crawler generally starts with a list of URLs to visit, called “seeds”. Then, the crawler identifies all of the hyperlinks in the seeds and renews the list of URLs, which in turn are recursively visited again.
However, a conventional web-based calendar page may include hyperlinks to web pages for linking references to previous and subsequent months or hyperlinks to other web pages for linking references to previous and subsequent years, weeks, and days so that hyperlinks to web pages of unnecessary dates can be generated. If web-based calendar pages are collected by an existing web crawler in this manner, unnecessary or meaningless web pages can be continuously collected due to a web trap by the infinite hyperlink loop, thereby causing to consume a storage space for storing the collected results which result in poor performance of the web crawler. Moreover, an increase in the amount of unnecessary or meaningless web pages entails an increase in load to the search engine.
Therefore, there is a need for an approach to improve the problems described above.