This invention relates generally to a method and computer-readable medium for analyzing data and, more specifically a method and computer-readable medium for analyzing data from a plurality of network sites.
Internet crawlers query web sites in order to get index information and provide Internet search data. In the past, no tool has existed that adequately analyzes the data resulting from web crawlers querying web sites. In this regard, it is desirable for an Internet analysis tool to provide statistics about data found on Internet sites. Desirable statistics include such diverse information as the percentage of educational sites, the average amount of graphics per site, the average amount of hyper-links per site, etc.
An acceptable Internet analysis tool must be able to query a large volume of web sites, scan the hypertext markup language (HTML) files downloaded from the sites and provide results of analysis criteria based on the contents of the HTML files. The tool should be able to process large volumes of data without operator intervention. The present invention is directed to providing such a tool.
In accordance with the present invention, a method and computer-readable medium for analyzing network data, in particular Internet data, is provided. The method and computer-readable medium for analyzing network data comprises: obtaining the identity of one or more sites (web sites in the case of the Internet) to query; obtaining one or more query criteria; accessing the one or more sites; and analyzing the query criteria in the site data.
In accordance with another aspect of the present invention, the results of an Internet analysis are displayed.
In accordance with a further aspect of the present invention, the results of an Internet analysis are stored.
In accordance with yet another aspect of the present invention, the query criteria is determined by the user. Preferably, the user determined query criteria is saved for subsequent analyses.
In accordance with yet a further aspect of the present invention, a default set of query criteria is provided. Preferably the default query criteria is user modifiable, and the user can either save modified query criteria as the new default query criteria, or as a different query criteria, leaving the existing default criteria unchanged.
In accordance with still further aspects of the present invention, a user selects the sites (e.g., the Internet web sites) to be analyzed.
In accordance with an alternative aspect of the present invention, the sites to be analyzed are randomly selected. Preferably, the number of sites to be randomly selected is determined by the user.
In accordance with further alternative aspects of the present invention, an existing site list is used to identify the sites to be analyzed. Preferably , the user can modify and save the site list.
In accordance with further aspects of the present invention analyzing the query criteria can be accomplished by counting occurrences of the query criteria in the site data. Alternatively, analysis can be accomplished by determining the size of the data specified by the query criteria.
In accordance with another aspect of the present invention, Internet trends are tracked by performing the same analysis at different times. Trends tracking can be done manually or automatically.
In accordance with yet another aspect of the present invention, the time increment for automatic trends tracking is determined by the user, such as on a monthly basis.
In accordance with yet still another aspect of the present invention, occurrences of a text string are counted if found anywhere within the HTML file. Alternatively, occurrences are only counted if found in a specified HTML tag. For example, files containing  less than script greater than  tags that have the xe2x80x9clanguagexe2x80x9d attribute where the attribute value is xe2x80x9cjavascriptxe2x80x9d. The preceding example provides the user with the summary information regarding the number of files found during an analysis that include JavaScript. Alternatively, the count may be about the tag itself, for example how often bold text is included in HTML files.
In accordance with a further aspect of the present invention analysis is only performed on the sites specified in the site list. Alternatively, links found in the site can be followed and analysis can be performed on the linked sites as well as the sites referenced directly in the site list.