The present invention relates to information provided by web servers and more particularly to the archiving of such information.
With the proliferation of content on the Internet or the World Wide Web it has become ever more difficult to keep track of content that relates to a particular topic. For example, content of existing web sites is typically changing on a regular basis. Furthermore, new content is generally being added on a regular basis. To be able to find such content may be time consuming. Furthermore, once new content is found, because of the transitory nature of the contents of web sites, this content may change and be lost.
One system which tracks changes in web pages is described in PCT publication WO 97/15890 entitled IDENTIFYING CHANGES IN ON-LINE DATA REPOSITORIES, the disclosure of which is incorporated herein by reference as if set forth fully herein. In this PCT publication, a system is described which tracks changes in web pages identified by a user""s xe2x80x9chot list.xe2x80x9d While the system described in the PCT publication may provide for tracking specific web pages, it does not provide for locating web pages related to a topic. Thus, the web pages are specified in advance by a user. Thus, the system of the PCT publication does not take into account the fluid nature of the Internet in both content of web pages and the existence of web pages.
Accordingly, improvements are needed in monitoring content provided by web servers.
Embodiments of the present invention include methods, systems and computer program products which provide for archiving information from a plurality of web servers by specifying at least one topic to be searched, searching the plurality of web servers so as to locate information associated with the at least one topic to be searched and retrieving the located information from at least one of the plurality of web servers. The retrieved information is archived so as to allow subsequent retrieval of the archived information independent of the plurality of web servers. This process is then periodically repeated so as to provide a history of information associated with the at least one topic.
In further embodiments of the present invention, the plurality of web servers are searched and the located information retrieved by identifying information accessible through one of the plurality of web servers associated with the at least one topic and retrieving the identified information. The identified information is then analyzed to determine if additional information associated with the topic is specified by the identified information and this additional information associated with the topic is retrieved. The additional information specified by the identified information may be stored at the one of the plurality of web servers. In such a case, the additional information is retrieved from the one of the plurality of web servers. Alternatively, the additional information specified by the identified information may be stored at a different one of the plurality of web servers. In such a case, the additional information associated with the topic is retrieved from the different one of the plurality of web servers.
In still further embodiments of the present invention, the identified information is analyzed by detecting hyperlinks in the identified information, wherein the hyperlinks specify additional information associated with the topic.
In other embodiments of the present invention, the search and retrieval of information is provided by analyzing the retrieved information to determine if additional information related to the at least one topic may be retrieved. The additional information is then retrieved and archived. Preferably, the archived additional information is associated with the at least one topic.
In yet other embodiments of the present invention, the topic to be searched is specified by specifying a plurality of keywords and a relationship between the plurality of keywords. Alternatively, the topic to be searched may be specified by identifying a document associated with at least one topic. The identified document may then be analyzed so as to identify characteristics of the document associated with the at least one topic associated with the document. A search may then be developed based on the identified characteristics of the document so as to search for information associated with the at least one topic.
In other embodiments of the present invention, the archived information is stored at a location local to a user. Alternatively, the retrieved information may be stored at a web server.
In still further embodiments of the present invention, a system for generating an information archive is provided. The system includes a database and a search servlet configured so as to periodically search and retrieve information stored in a plurality of information sources associated with a user specified topic. An archive servlet is configured so as to store the information retrieved by the search servlet in the database and to associate the stored information with the user specified topic. Also, an archive user interface program is configured so as to access the database to retrieve information stored in the database and associated with the user specified topic independent of the information stored in the plurality of information sources.
In further embodiments, a search user interface program is configured so as to specify the topic for the search servlet to periodically search. The search user interface program and the archive user interface program may be configured to allow access to the search servlet and the archive servlet by a web browser.
In other embodiments, the search servlet is further configured to analyze the retrieved information and retrieve additional information identified in the retrieved information and wherein the archive servlet is further configured to stored the additional information.
While the invention has been described above primarily with respect to the method aspects of the invention, both systems and/or computer program products are also provided.