1. Field of the Invention
The present invention is related to the area of Internet search technologies and resource gathering using web crawling techniques, and in particular to a method and apparatus for automatically gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information.
2. Description of Related Art
In the early beginning of the Internet, most web sites served static pages and content. The format of these pages are typically represented in HTML (Hypertext Markup Language), and the contents do not change unless modified by the site administrator or provider. Internet search providers use standard web crawling techniques to provide search facilities to collect static data from these websites and to summarize and index the data. The trend today is moving forward to dynamically created web pages using scripting technologies on the server side (e.g. Active Server Pages, CGI, etc.). Database content is made available through web gateways. Web gateways process information requests and return the requested page or document to the user. Standard web crawling techniques are not sufficient to gather dynamic content.
Some websites generate dynamic content and require user input/interaction to access the data. These sites are typically shopping or password protected sites providing personalization features based on specific user input. In order to keep track of user preferences, personal data, and passwords, these sites issue xe2x80x9ccookiesxe2x80x9d to store status information. A xe2x80x9ccookiexe2x80x9d is data that is stored on a users machine and is read by the server that sets it. The server reads the cookie when the user returns to a site and the site is then personalized with a greeting such as xe2x80x9cWelcome Back John Doexe2x80x9d. This user will not be able to navigate the site unless that cookie is read from their machine.
The main problem is that these dynamic web sites provide valuable content and information, which is not possible to automatically gather and index using existing technologies. However, it would be very valuable if this data were available and indexed for other meta search engines to search. For example, consider a database of books found at the website of xe2x80x9cAMAZON.COMxe2x80x9d(copyright) (http://www.amazon.com). This database contains data on millions of books, which may include the name of the book, the author, as well as an abstract or summary of the book. But more importantly, the database also contains reviews about these books, written by people who actually read the book. This site makes extensive use of personalization features and cookies, which we can describe as an interactive behavior containing session information. When a user or client visits the xe2x80x9cAMAZON.COMxe2x80x9d site, the xe2x80x9cAMAZON.COMxe2x80x9d server tries to set a xe2x80x9ccookiexe2x80x9d, which has to be accepted by the client. Many web browsers have automatic functionality built in which will handle this, and asks the user whether to accept or reject the cookie request. The standard web crawler is not able to systematically crawl the site and replicate the database because of the need for user interaction. There is no mechanism to simulate the user""s behavior, or interaction, during a typical search session.
There are many more databases of books, such as xe2x80x9cBarnesAndNoble.comxe2x80x9d, and xe2x80x9cFatBrain.com.xe2x80x9d Essentially, the basic book data they keep is similar, however any additional information they provide may vary and could provide useful insights to one seeking information on a particular book. Thus, it would be of great benefit for a web browser or crawler to be able to navigate these sites, among others, and automatically retrieve and process the content and information available.
In another example, a domain specific search engine like xe2x80x9cjCentralxe2x80x9d from IBM, (http://www.ibm.com/developer/ibm), which is focused on the programming language xe2x80x9cJavaxe2x80x9d, might be interested in providing a search feature for books about xe2x80x9cJava.xe2x80x9d So it would be a benefit for software developers if xe2x80x9cjCentralxe2x80x9d could create an index of the data on xe2x80x9cJavaxe2x80x9d which is stored on xe2x80x9cAMAZON.COMxe2x80x9d, and provide a domain specific search for interested xe2x80x9cJavaxe2x80x9d developers. In order for xe2x80x9cjcentralxe2x80x9d to be able to perform such a search on a website such as xe2x80x9cAMAZON.COMxe2x80x9d, it is necessary for xe2x80x9cjCentralxe2x80x9d to be able to navigate and interact with the dynamic website. However, standard web crawling techniques cannot automatically simulate the necessary user interaction required to navigate the sites and retrieve the desired information and content from the website.
Bearing in mind the problems and deficiencies of the prior art, it is therefore an object of the present invention to provide an apparatus and method to automatically simulate user interaction with a dynamic website.
It is another object of the present invention to provide an apparatus and method for a webcrawler to automatically simulate interactive behavior of a user in order to search and query dynamic websites.
A further object of the invention is to provide an apparatus and method for a webcrawler to automatically simulate interactive behavior of a user in order to gather and extract information from a dynamic website.
Still other objects and advantages of the invention will in part be obvious and will in part be apparent from the specification.
The above and other objects and advantages, which will be apparent to one of skill in the art, are achieved in the present invention which is directed to, in a first aspect, an automated method of gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information. The method comprises the steps of identifying at least one uniform resource locator (xe2x80x9cURLxe2x80x9d), a document type definition (xe2x80x9cDTDxe2x80x9d) for the URL and at least one search topic to be searched on the URL. The URL is queried with the URL, DTD and at least one search topic and the results are returned. In the preferred embodiment, after retrieving at least one result of the query, it is determined if there is another search topic to search the URL with. If so, another query of the URL is performed with the additional search topic, and the results are returned. In the preferred embodiment, these steps are repeated until all search topics have been searched on the site.
In the preferred embodiment, after the step of identifying at least one search topic to be searched, a query template is formed using the URL, DTD and search topic to complete a search query string. The search query string is adapted to be submitted to the URL to perform a hypertext transfer protocol request.
After the step of retrieving at least one search result, it is also preferred to determine if additional search results are available, and if so, to perform a page navigation to retrieve the additional search results. This page navigation may be repeated until all search results have been retrieved.
In another aspect, the present invention is directed to an article of manufacture comprising a computer usable medium having computer readable program code means for automatically gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information. The computer readable program code means in the article of manufacture comprises computer readable program code means to identify a URL for a website to be queried, computer readable program code means to identify a data type definition for the URL, computer readable program code means to identify at least one search topic to be searched on the URL, and computer readable program code means to query the URL with the DTD and at least one search topic, and computer readable program code means to retrieve the results of the query.
In the preferred embodiment, the article further comprises computer readable program code means to determine if the URL is to be searched with additional search topics and computer readable program code means to perform additional queries of the URL until all topics have been searched, and computer readable program code means to retrieve all search results.
It is also preferred that the article of manufacture comprise computer readable program code means to form a query template using the URL, DTD and search topic to complete a search query string, which is adapted to be submitted to the URL to perform a hypertext transfer protocol request.
In the preferred embodiment the article further comprises computer readable program code means for determining if additional search results are available and computer readable program code means for performing a page navigation to retrieve all search results.
In another aspect, the present invention is directed to a computer program product comprising a computer usable medium having computer readable program code means embodied in the medium for automatically gathering dynamic content and resources on the world wide web by simulating user interaction and managing session information. The computer program product includes computer readable program code means for causing a computer to identify a URL for a website to be queried, identify a data type definition for the URL, identify at least one search topic to be searched on the URL, and conduct a search using the URL, DTD and search topic. The present invention also includes computer readable program code means for causing a computer to retrieve the results of the query and perform a page navigation in order to retrieve all the search results. In the preferred embodiment, the present invention also includes computer readable program code means to determine if the URL is to be searched with a second search topic to perform additional queries until all search topics have been searched.
In the preferred embodiment, the computer program product further comprises computer readable program code means for causing a computer to form a query template using the URL, DTD and search topic to complete a search query string to be submitted to the URL to perform a hypertext transfer protocol request.
In another aspect, the present invention is directed to a computer program product for automatically gathering dynamic content and resources on the world wide web comprising a computer usable medium having computer readable program code means embodied in the medium for causing a computer to simulate user interaction and managing session information with a website. In the preferred embodiment, the computer program product includes computer readable program code means for causing a computer to determine at least one website with a URL to be searched and a document type definition for the website and to create a query search string for a website using the uniform resource locator and document type definition. In the preferred embodiment, the computer program product includes computer readable program code means for causing a computer to determine at least one search topic to be searched on the website, to insert the topic into the query string, to query the website with the query string, and to receive the results of the query.
In the preferred embodiment, the computer program product includes computer readable program code means for causing a computer to determine if there are additional search topics to be searched, and to repeat the foregoing process for each additional search topic until all search topics are searched.
In another aspect, the present invention is directed to an automated method of gathering content and information from a dynamic website comprising the steps of: identifying a uniform resource locator (xe2x80x9cURLxe2x80x9d) for a website to be searched, determining if the URL is a dynamic website, obtaining a session data for the URL, formatting a search query string using the session data and a document type definition for the URL, formatting the search query string with a first topic to be searched to form a first search query string, performing a hypertext transfer protocol request of the website with the first search query string and processing a first set of search results for the first search query string.
In the preferred embodiment, the method further comprises determining if there are additional topics to be searched and repeating the foregoing steps for each topic until all topics are searched and all results processed.
It is also preferred that the step of determining if said URL is a dynamic website further comprise performing a hypertext transfer protocol GET method of the website, downloading a content including a header of the website, and scanning the header for the session data which may be represented by a cookie.
In another aspect the present invention is directed to an article of manufacture comprising a computer usable medium having computer readable program code means for automatically gathering content and information from a dynamic website comprising computer readable program code means to identify a URL for a website to be queried, to determine if the URL is a dynamic website, to obtain a session data for the URL, to format a search query string using the session data and a document type definition for said URL, to format the search query string with a first topic to be searched to form a first search query string, to perform a hypertext transfer protocol request of the website with the first search query string and to process a first set of search results for the first search query string.
In the preferred embodiment, the computer readable program code means to determine if the URL is a dynamic website comprises computer readable program code means for performing a hypertext transfer protocol GET method of the website, downloading a content and header of the website, and scanning the header for the session data which may be represented by a cookie.
In another aspect, the present invention is directed to a computer program product comprising a computer usable medium having computer readable program code means embodied in the medium for of gathering content and information from a dynamic website. The computer readable program code means includes means for causing a computer to identify a uniform resource locator (xe2x80x9cURLxe2x80x9d) for a website to be searched, to determine if the URL is a dynamic website, to obtain the session data for the URL, causing a computer to format a search query string using said session data and a document type definition for said URL, to format the search query string with a first topic to be searched to form a first search query string, to perform a hypertext transfer protocol request of the website with the first search query string, and computer readable program code means for causing a computer to process the search results of the search. In the preferred embodiment, the computer readable program code means for causing a computer to determine if the URL is a dynamic website comprises computer readable program code means for causing a computer to perform a hypertext transfer protocol GET method of the website, download the content and header of the website, and scan the header for the session data which may be represented by a cookie.