A web crawler (also referred to as a web spider, a web robot or a web scutter) is a program or automated script which attempts to browse the World Wide Web in a methodical, automated manner. Web crawlers are often used for indexing the pages, for purposes such as updating a database of a search engine. Other purposes may include automating maintenance tasks on a website, such as checking links or validating Hypertext Markup Language (HTML) code.
A web crawler typically starts with a hard coded or otherwise obtained list of resource identifiers such as Uniform Resource Identifiers (URIs) or Uniform Resource Locators (URLs), the list initially called the seeds. Upon visiting each URI, the crawler identifies all hyperlinks in the page indicated by the URI and adds them to the initial list of URIs to visit. The added URIs are then visited according to a set of policies, and the process continues recursively. Within each visited page, the text or HTML content of the page is discovered and optionally further processed, for example parsed and indexed.
However, there are scenarios that restrict or disallow the access and activity of a web crawler. One such scenario is the existence of dynamic web pages, to which no link exists. This situation may occur, for example, when a user presses a “Submit” button after filling in a form, or in Web 2.0 applications which create links by executing scripts or other programming units such as JavaScript, or other situations in which URIs are created on-the-fly.
Further, the content of such pages, but also of other pages accessible by a regular link e.g., a hyperlink, may not always be pure text or HTML, but can rather contain non-HTML content, such as JavaScript, Flex, or Silverlight code embedded in the HTML, or any other technology that creates non-HTML content. Web crawlers are thus unable to parse, identify and make use of the content of such web pages.
These situations of dynamically constructed web pages is common for example in portal applications which usually rely on dynamic content rendering. Such content might be unreachable for typical web crawlers because navigation from one portal page to another is not realized through hyperlinks but rather comes as a result of execution of an application's internal logic. For example, such application can be used for enabling department members to view user information of all other department members. The application can be required to expose the user information to internal search engines so that these pages can later be searched. However, the links to such pages, as well as the contents of each such page are constructed dynamically and can thus not be reached and indexes by a web crawler.
There is thus a need for a method and apparatus for enabling a web crawler to reach dynamic web pages, and to index the contents of such web pages.