1. Field of the Invention
The present invention generally relates to a method and system for periodically searching through files accessible through a network, and in particular, to a method and system for searching through files accessible on a network during scheduled period searches of files based on data from files previously accessed.
2. Description of the Related Art
A network server maintains various files accessible across a network. In the case of the Internet, the files may comprises hypertext mark-up language (HTML) data, Common Gateway Interface (CGI) script, image files (e.g., .jpg and .gif), and Channel Definition Format (CDF) files. Collectively, the files linked through HTML files produce a website, wherein the server acts as the website host.
CDFs are small files which include data used by websites' “push” to specify how often and what parts of the site will be “pushed” (e.g., e-mailed) directly to a registered subscriber. Based on the data in the CDF, the website will e-mail various information to the subscriber.
A typical CDF file is an Extended mark-up language (XML) file. A CDF file contains various elements referred to as tags. Some tags include CHANNEL, ITEM, USERSCHEDULE, SCHEDULE, LASTMOD, and LEVEL.
The CHANNEL tag has an HREF attribute that specifies the Universal Resource Locator (URL) on the website that corresponds to that CHANNEL. For example:
<CHANNEL HREF=“http://www.mysite.com/Channel/homepage.htm”>
The SCHEDULE tag indicates when a channel should be updated. For example:
<SCHEDULE STARTDATE=“1999-09-23” STOPDATE=“1997-11-23”>
                <INTERVALTIME DAY=“1”/>        <EARLISTTIME HOUR=“2”/>        <LATESTTIME=“6”/>        </SCHEDULE>        
indicates that the channel should be updated every day between the start date and the stop date between 2 and 6.
Occasionally, a channel may have a subchannel. A subchannel refers to sub-sites on the website. A subchannel may appear as:
<ITEM HREF=“foobar.htm” LASTMOD=“1999-01-01 TO0101” LEVEL=“2”>
<USAGE VALUE=“ScreenSaver”></USAGE>
</ITEM>
A subchannel references a URL with information about when the page was last modified, and from this URL whether the information is relevant.
A conventional search engine accesses websites on the network. The search engine downloads data from the website and archives selected downloaded data. The archived data is linked to the website from which it was downloaded.
One can use the search engine to search for a particular website containing desirable information by entering a query into the search engine. The search engine will search its archived data and return websites in its archived database which relate to the query.
The dynamic nature of the Internet results in websites being updated regularly. Consequently, data which was on the website when the search engine initially visited the website may no longer be there. Alternatively, the data may be outdated. Further, the website may no longer exist or its URL may have changed. As a result, data archived by the search engine could become invalid. In order for the search engine to be a useful tool, the search engine must periodically update its archived data.
A conventional search engine uses a web crawler (e.g., a “robot”, “spider”, “ant”, etc.) to visit (i.e., access) a server on a network. The spider “crawls” from a homepage (i.e., the first or main webpage) of a website to the various subpages linked from the homepage. As the web crawler visits the various homepages with subpages, data on the pages are selectively archived by the search engine.
The typical crawlers visit web sites at regular intervals, for example, every 30 days. If a web crawler accesses a website which has not been updated since the last time the web crawler visited, the web crawler would presume that the data previously archived is still valid. This may be erroneous.
That is, one disadvantage with current web crawler technology is that the web crawler does not know when a website is scheduled to be updated. Depending on how often a website is updated, the web crawler's archived data could be very outdated by the time the web crawler returns. On the other hand, frequent web crawler visits to websites not frequently updated consumes valuable computer resources.