Use of the Internet as a means for accessing information, shopping, entertainment, research, or performing functions such as paying bills or even registering to vote has become ubiquitous. However, relatively few Internet business models have succeeded. Those that have often rely on advertising revenue for a significant portion (if not all) of the business' income. Because of the number of Internet businesses, competition for this advertising revenue is significant. At the same time, advertisers tend to spend their money with those businesses that can demonstrate public appeal. Hence, operators of Internet businesses need means by which they can demonstrate this appeal in an effort to win advertising revenue.
At the same time, advertisers also need means of accurately tracking traffic to those Internet businesses which they are paying to host their ads. Just as television advertising is priced based on the number of individuals likely to watch a particular program, Internet advertising is often priced based on the number of individuals that view a particular web page. Thus, measuring the number of visitors to web sites is critical to the success of both Internet businesses that sell advertising and to advertisers paying such businesses for running their ads.
In addition to its use in connection with advertising, the tracking of traffic to and from various Internet destinations is also of importance to enterprise network owners. For example, a company may wish to monitor its employees' use of company-owned computer systems to browse Internet web sites. Such information may be tracked to determine compliance with corporate policies, to monitor potential security breaches and, more generally, to ensure that the company's computer systems are not being misused.
Further, measuring the number of visitors to particular portions of a web site may assist the owner or designer of that site when it comes to planning modifications or upgrades to the site, determining what sort of content to host at the site, and/or providing navigation aids to/from other portions of the site. Likewise, content providers can benefit from such measurements inasmuch as it may help them determine what content is popular among visitors to Web sites and, therefore, what sort of content to produce in the future. In short, the tracking of Internet traffic is of great importance across a wide variety of industries.
Systems for measuring Internet traffic typically revolve around the use of log files. Log files are text files that contain records of file requests made to Internet hosts (e.g., servers and the like). Log files, however, tend to be very large and difficult to read. FIG. 1 is an example of a log file 10, or, rather, a small portion of a log file of a particular day. Although a wealth of information is included in such a file, it is not easy to extract meaningful information from such a file and doing so requires a great deal of experience and familiarity with the traffic being analyzed.
To understand why log files can be so complex and difficult to interpret, consider the network arrangement shown in FIG. 2. In this rather basic arrangement, a user at a computer system 20 is seeking to access information published at a web sited hosted by server 22. Suppose, for sake of example, this is a news web site, designated Local_News.com. The user can use a conventional web browser at computer system 20 to access the Local_News.com site and requests for web pages are passed from computer system 20, through a proxy 24 and Internet 26, to server 22 which hosts the site. In response, the requested content is returned to the web browser.
In this example, suppose the user's computer system 20 is part of a network 28 (e.g., a company's enterprise network) and proxy 24 manages all Internet requests from computers associated with that network. A proxy is a computer system that sits between a client application, such as a web browser running on a user's personal computer, and a remote computer system, such as a server where content is stored. The proxy has several functions, among them: intercepting requests to the server to see if the proxy can fulfill the requests itself (thereby improving performance), and filtering requests, for example to enforce a company's policy that employees not access certain web sites.
In this case, the proxy 24 also logs accesses to Internet resources (such as server 22) made by computers associated with network 28 and periodically sends the log files to a log server 30, where the log files are stored for later review by an analyst 32. As should be apparent, one reason the associated log files will be very complex is that they will include information for all accesses made by all computers associated with network 28. This may be dozens or even hundreds of individual computer systems.
Moreover, even simple accesses, such as the access by computer 20 to the Local_News.com server 22, involve multiple transactions. The user of computer system 20 may be interested in viewing the main web page associated with the site (e.g., Local_News.com/index), however, that web page is, in fact, not really a single page. Instead, it (like most web pages) is really a series of computer-readable instructions that tells the user's web browser how to render certain information on the display of the user's computer system 20 and where to find the information objects (images, videos, etc.) to place in designated portions of that layout. Thus when even a single web page is requested, that request may actually involve many individual transactions from many different content sources, such as an advertisement server 34 (to retrieve advertisements displayed in the context of the requested web page) and media server 36 (to retrieve video and/or images that are to be rendered within the context of the requested web page).
All of these various transactions pass through proxy 24 and are recorded as part of the log file. Thus, even a simple web page request may result in many separate entries within the log file. Multiply such requests by the dozens or hundreds of requests being made by all of the computers associated with network 28, and one can quickly see why log files are such complex documents and why analyzing log files is difficult and time consuming.
Hence, there is a need for a method and system for condensing log files into easier to understand documents for analysis.