This invention relates to a method and system for resolving Universal Resource Locators (URLs) from script code located in websites for the purpose of website crawling.
The World Wide Web available on the Internet provides a variety of specially formatted documents called web pages. The web pages are traditionally formatted in a language called HTML (HyperText Markup Language). Many web pages include links to other web pages which may reside in the same website or in a different website, and allow users to jump from one page to another simply by clicking on the links. The links use Universal Resource Locators (URLs) to jump to other web pages. URLs are the global addresses of web pages and other resources on the World Wide Web.
As web technology evolves, websites become more and more complex. The tendency in website development is to move from using purely static HTML to using HTML and script code to provide enhanced functionality. As a result, it is now common to use script code to construct web page links, i.e., to create URLs dynamically. Often the process of dynamically constructing URLs involves many variables and some rather complex script code. This makes it very difficult to resolve, i.e., extract and obtain, such URLs, when it comes to website crawling.
Website crawling or spidering is a process to automatically scan contents of websites by following links and fetching the web pages. Web crawling agents or “spiders” are software programs for performing the crawling over websites. Typically, existing web crawling agents are used to find specific information of interest in the Web.
Before the introduction of script code into Web pages, crawling agents could parse HTML code for standard URLs. Since all URLs had to be coded to the HTML specification, this task was relatively easy. However, as sites evolved they increasingly relied upon script code to provide more advanced functionality that standard HTML did not allow for. The format of the URLs in the script code varies widely from implementation to implementation. Unlike static HTML, there is no standard that the script code must follow for encoding URLs. Accordingly, script code presents problems for crawling agents that need to parse URLs. There is no longer a common syntax or format for the URLs and thus they are difficult to find consistently.
An existing approach to this problem is to use customizable pattern matching algorithms that statically read through the script code on a page or in a script file, and based on pattern matching try to “guess” what in that script code might be a URL. The pattern matching provides some utility but the use of the pattern matching algorithms has two basic problems: 1) the algorithms invariably miss URLs in the script code and 2) the algorithms do not always extract the entire URL correctly.
It is therefore desirable to provide a new mechanism that can more accurately resolve URLs from script code embedded in web pages.