Web site owners and Web site builders are interested in various statistics, such as who is browsing a Web site, what content users are requesting or downloading from a Web site, and when users are requesting or downloading content from a Web site. This type of information can be useful for determining the content, designs, or marketing campaigns that attract site visitors, retain them, and induce online purchasing decisions. Typically, Web site activity information is stored in log files on a Web server as the activity occurs.
In general, a log is a record of computer activity used for statistical purposes as well as troubleshooting and recovery. Many log files store information, such as incoming command dialog, error and status messages, and transaction detail. Web server logs are a rich source of information about user activity that a Web server automatically creates. The basic information stored in a log file is centered around a user request for resources from the Web server. Resources can be either Web pages, image files, or other media served by Web servers. The Web server logs information such as when a request is made, the requester's Internet Protocol (IP), address and domain (e.g., .gov, .edu., .com, etc.), the resource requested, and the server's success in fulfilling the request. Based upon the information in the logs, Web analytics professionals analyze data such as, requests (commonly known as hits), page views, and sessions.
Web server log file analysis has proven to be an inexact science for a number of reasons. One of the main reasons for this problem is data loss in the logs at both the point of recording and during the transfer process from storage to an analytical tool. Most log analysis applications do not deal with the grave issues surrounding data loss. Data loss may be caused, for example, by a Web server going off-line or otherwise temporarily being unable to write log records. Another frequent cause of data loss results from the electronic transfer of log files from servers to other computers where they will be analyzed. In many cases the transfer may appear to have been successful although some data was lost in the process. A less frequent but rather large scale problem is the addition of cloned Web servers to a network serving a Web site.
In many cases, the users of Web log analysis tools are not the owners of the Web servers, but instead are the content owners (i.e., Web hosting model), or provide Web analysis services to content owners (i.e., Web analytics service model). In this situation, the Web analytics professional must rely upon the Web Hosting company to have reliable servers and to ensure a reliable log transfer process from the hosting servers to the analytical software. The company must know when the hosting company servers have down time and when they make network configuration changes, such as the adding of clones for load balancing are made. Often, this information is not reliably provided to the analytics professional. Additionally, log files may be corrupted during the transfer process, such as FTP (File Transfer Protocol), from Web server to analytical tool.
As a result, log analyses may be based upon only partial data sets with the degree and timeliness of data loss being random. Many Web analytics professionals currently rely upon guessing as to when data loss occurs, when data loss is suspected, and to what degree its effect. To address the issue, they usually either merely note that data loss happened or further attempt to supply subjective estimates of summary statistics. For example, they may guess at numbers or just use the last reporting period data. Since in most cases the analytics professionals are actively involved in trying to remedy whatever caused the data loss so as to avoid it in the future, they often do not spend a great deal of time applying systematic methods to account for data loss. Poor quality data is then passed on to those end users for whom Web site activity statistical reports are generated, such as Web site designers or marketing personnel.
Therefore, it would be advantageous to have an improved systematic method and apparatus for identifying when data loss occurs, identifying how much data loss has occurred, and providing remunerative action for the data loss to generate a more accurate analysis.