This invention relates generally to web-server traffic data analysis and more particularly to a system and method for analyzing web-server log files.
The worldwide web (hereinafter "web") is rapidly becoming one of the most important publishing mediums today. The reason is simple: web servers interconnected via the Internet provide access to a potentially worldwide audience with a minimal investment in time and resources in building a web site. The web server makes available for retrieval and posting a wide range of media in a variety of formats, including audio, video and traditional text and graphics. And the ease of creating a web site makes reaching this worldwide audience a reality for all types of users, from corporations, to startup companies, to organizations and individuals.
Unlike other forms of media, a web site is interactive and the web server can passively gather access information about each user by observing and logging the traffic data packets exchanged between the web server and the user. Important facts about the users can be determined directly or inferentially by analyzing the traffic data and the context of the "hit." Moreover, traffic data collected over a period of time can yield statistical information, such as the number of users visiting the site each day, what countries, states or cities the users connect from, and the most active day or hour of the week. Such statistical information is useful in tailoring marketing or managerial strategies to better match the apparent needs of the audience. Each hit is also encoded with the date and time of the access. Because the statistical information of interest is virtually all related to time periods, accurately tracking the time of each hit is critical.
To optimize use of this statistical information, web server traffic analysis must be timely. However, it is not unusual for a web server to process thousands of users daily. The resulting access information recorded by the web server amounts to megabytes of traffic data. Some web servers generate gigabytes of daily traffic data. Analyzing the traffic data for even a single day to identify trends or generate statistics is computationally intensive and time-consuming. Moreover, the processing time needed to analyze the traffic data for several days, weeks or months increases linearly as the time frame of interest increases.
The problem of performing efficient and timely traffic analysis is not unique to web servers. Rather, traffic data analysis is possible whenever traffic data is observable and can be recorded in a uniform manner, such as in a distributed database, client-server system or other remote access environment.
Some web servers are so busy, i.e., handle so much traffic, that they require multiple servers to handle all of the traffic. Other users may need to employ multiple servers because of the large size of the web site. Critical sites, i.e., ones that cannot afford to be down because of a problem with a server, may also choose to deploy their site on multiple servers. Such multiple servers are sometimes referred to as a server farm. Server farms provide high bandwidth reliable access to web sites.
There are several topologies that may be used in a server farm, but the most important ones divide the farm into clusters of servers. The web site is mirrored on each server within the cluster. Special hardware receives all of the traffic to the web site and distributes each hit to one of the servers. Some systems provide accurate load balancing in that all of the hits are rotated in sequence among each of the servers. But others assign each hit from a new source to a server, and further access to the site from that source is directed to the assigned server. This is accomplished by assigning a predetermined time period, for example 30 minutes, during which all future access from the same source is considered to be part of a single session from that source. As described further below, the latter approach permits some log-file analysis, which is not possible using the load-balancing technique.
Server farms, although providing load balancing and redundancy, present problems in analyzing the log files generated by the servers. Prior art systems for analyzing web-server log files can handle multiple log files, but these files are consecutively generated, i.e., the data packets within each log file are in chronological order and the log files themselves correspond to time periods containing data packets from within the periods. In other words, the log files are also consecutively generated. Log files on servers in a server farm, however, are concurrently generated. Each log file covers or overlaps the same time period. On server farms that rotate the hits among each server, log file analysis programs do not generate useful information. Brute force solutions are possible, such as sorting all of the log files and creating a new single file, or copying all of the hits from each log file to a large database, which can sort and analyze the data. These solutions have severe drawbacks: they are computationally intensive, they require creation of large new files, and they are done only after log files are complete, i.e., not on the fly while the log file is still being populated.
Server farms that assign hits from a new source to a single user can run prior art log analysis programs on each server and sum the results. This, however, is not completely accurate and is disadvantageous because it requires generation of separate reports that must each be consulted or further manipulated to obtain information that applies to the entire server farm.
There is consequently a need for a system and method for analyzing web-server log files that are concurrently generated, such as those generated by a server farm.
There is a further need for such a system and method that can analyze the log files substantially in real time.
There is still a further need for such a system that can analyze the log files without generating new large files and without the need for substantial additional computing power.
There is also a need for such a system that can analyze log files whether they are concurrently or consecutively generated.