Web servers typically produce logs of activity that provide a record of the requests received and the responses sent by the web server. The W3C maintains a standard format for web server log files (see, e.g., “http://www.w3.org/TR/WD-logfile”), but other proprietary formats exist. The majority of analysis tools support the standard log file format but the information about each server transaction is fixed. The server typically appends more recent entries to the end of the log file, and the server may periodically start a new log file (e.g., when the current log reaches a certain size or a period passes). The server typically adds information about the request, including client IP address, request date/time, page requested, HTTP response code, bytes served, user agent, and referrer. The server can combine these fields into a single file, or separate them into distinct logs, such as an access log, error log, or referrer log. These files are usually not accessible to general Internet users, only to the webmaster or other administrator. Following is an example of a typical web server log.
#Version: 1.0
#Date: 12-Jan-1996 00:00:00
#Fields: time cs-method cs-uri
00:34:23 GET /foo/bar.html
12:21:16 GET /foo/bar.html
12:45:52 GET /foo/bar.html
12:57:34 GET /foo/bar.html
Webmasters may use statistical analysis of web server logs to examine traffic patterns by time of day, day of week, referrer, or user agent. Analysis of the web server logs can aide efficient web site administration, adequate hosting resources, and the fine-tuning of sales efforts. Web analytics is the measurement, collection, analysis, and reporting of internet data for purposes of understanding and optimizing web site usage. On-site web analytics measure a visitor's journey once on a web site. This includes drivers and conversions; for example, which landing pages encourage people to make a purchase, as well as performance of the web site in a commercial context. This data is typically compared against organization performance indicators, and used to improve a web site or marketing campaign's audience response.
Many different vendors provide on-site web analytics software and services. There are two main technological approaches to collecting the data. The first method, log file analysis, reads the log files in which the web server records all its transactions. The second method, page tagging, uses JavaScript on each page to notify a third-party server when a web browser renders a page. Both collect data that can be processed to produce web traffic reports.
Web log analysis software (also called a web log analyzer) is a simple kind of web analytics software that parses a log file from a web server, and based on the values contained in the log file, derives indicators about who, when, and how a web server is visited. Usually reports are generated from the log files immediately, but the log files can alternatively be parsed to a database and reports generated on demand. In the early 1990s, web site statistics consisted primarily of counting the number of client requests (or hits) made to the web server. This was a reasonable method initially, since each web site often consisted of a single HTML file. However, with the introduction of images in HTML and web sites that spanned multiple HTML files this count became less useful.
The extensive use of web caches also presented a problem for log file analysis. If a person revisits a page, the second request will often be retrieved from the browser's cache, and so the web server will receive no request. This means that the person's path through the site is lost. Caching can be defeated by configuring the web server, but this can result in degraded performance for the visitor to the website. Web analytics vendors combated this by adding client side logic that caused the client to report usage information to a log server, prompting more log analysis.
Web log analysis still exhibits a number of undesirable problems. First, there are delays inherent in the process of logging. There is a delay from the time a request is received to the time it is written to the log (e.g., because of delayed disk cache flushing by the operating system or hardware), delays in getting the logs to the place where they will be analyzed, and delays in processing the logs and providing the data in a format suitable for analysis, such as via rows in a database. Each of these delays mean that a content provider cannot find out up to the minute information about the providers site is being used. For some types of content, such as live media events, this can mean no meaningful analysis of the event's success until the event is over. Some decisions, such as load balancing, may improve with more immediate information about site usage, which is typically obtained in other ways (such as by monitoring performance counters) that provide only course-level data (e.g., without visitor or request information).
In many cases, it is useful to record more information than is supported by the standard log format. Sites sensitive to personal data issues may wish to omit the recording of certain data. Thus, a second problem is that the web site log files may not actually contain the data most relevant to the content provider, and the web server may not support providing any more than a handful of predefined fields of data. For other types of information, the content provider may have to write custom extensions for the web server or simply be unable to obtain the data. Sometimes the content provider can obtain the additional data at the client, but then log analysis software performs an extra step of attempting to correlate client and server logs to provide a complete picture of what happened for a single client. This data correlation also adds delays to web traffic analysis.