Field of the Invention
The present invention relates in general to the field of information processing. In one aspect, the present invention relates to a system and method for collecting and analyzing Internet website traffic.
Description of the Related Art
Most website servers can be configured to store information in a log file for every website page request they receive. Statistics concerning every request for a page from the website are recorded in the log file in a linear log file format, where each request is logged separately from each other request, and the requests are logged in approximately chronological order. The log file is a record that can be analyzed to produce a website traffic report. The statistics typically include date, time of day, browser location, type of request, uniform resource identifier, referring link, cookie or session identification, and the like. The creation of the log file will occur automatically, as html documents are requested by browsers accessing the website server. The log file can be analyzed to process and summarize the collected statistics. The steps for retrieving hyper text markup language (“HTML”) documents from a website server that includes a logging function are as follows. First, a web browser sends a request to a website server for an html document. Next, the website server receives the request from the browser. The website server then returns the requested html document to the web browser. Finally, the website server logs the transaction to a log file.
As a result of the foregoing, a log file for a website server may contain statistical information for a variety of different users and sessions. For example, an example log file containing ten web server requests from four different client web browsers might include the following data:
datetimeipmethoduri-stemcookie2001 Feb. 2700:23:00192.168.11.226GET/agn/LoadingPage.htmlsessionid=a5622001 Feb. 2700:23:00192.168.11.226GET/agn/lib/DOMLevel2.jssessionid=a5622001 Feb. 2700:30:17192.168.24.245GET/agn/logon.jspsessionid=b8282001 Feb. 2701:06:59192.168.11.226GET/agn/LoadingPage.htmlsessionid=a5622001 Feb. 2702:10:1710.0.48.179GET/agn/logon.jspsessionid=c4372001 Feb. 2702:17:1910.0.48.179GET/agn/LoadingPage.htmlsessionid=c4372001 Feb. 2702:27:2710.0.48.180GET/agn/images/down.gifsessionid=d1402001 Feb. 2702:36:4210.0.48.179GET/agn/JavaScript/grid.jssessionid=c4372001 Feb. 2703:25:5010.0.48.180GET/reports/ak013/order.gifsessionid=d1402001 Feb. 2703:56:30192.168.11.226GET/agn/images/logo.gifsessionid=a562
A simple analysis of this example log file will examine each line in the log file sequentially, keeping only summary information as the processing moves from one line to the next. For example, an analysis of this type might calculate the following pieces of summary information:
There were 3 client requests to the web server in the first hour (between 00:00:00 and 01:00:00).
There was 1 client request to the web server in the second hour (between 01:00:00 and 02:00:00).
There were 4 client requests to the web server in the third hour (between 02:00:00 and 03:00:00).
There were 2 client requests to the web server in the fourth hour (between 03:00:00 and 04:00:00).
There were visits from 4 distinct IP address (web client machines).
URIs beginning with “/agn” were visited 9 times.
URIs beginning with “/reports” were visited once.
In a more sophisticated analysis of the log file, more detailed information may be collected at the session level. Such an analysis will use some method (such as a cookie, IP address, or other identifier) to determine which requests belong to the same user session. It will then examine all records relating to the same session together to gather a complete and detailed picture of the actions performed by each individual user.
To do this analysis at the session level, it is often helpful to group log file records for the same session together, then process each group of records session-by-session. For example, the log file records described above would be grouped as follows:
datetimeipmethoduri-stemcookieGroup 12001 Feb. 2700:23:00192.168.11.226GET/agn/LoadingPage.htmlsessionid=a5622001 Feb. 2700:23:00192.168.11.226GET/agn/lib/DOMLevel2.jssessionid=a5622001 Feb. 2701:06:59192.168.11.226GET/agn/LoadingPage.htmlsessionid=a5622001 Feb. 2703:56:30192.168.11.226GET/agn/images/logo.gifsessionid=a562Group 22001 Feb. 2700:30:17192.168.24.245GET/agn/logon.jspsessionid=b828Group 32001 Feb. 2702:10:1710.0.48.179GET/agn/logon.jspsessionid=c4372001 Feb. 2702:17:1910.0.48.179GET/agn/LoadingPage.htmlsessionid=c4372001 Feb. 2702:36:4210.0.48.179GET/agn/JavaScript/grid.jssessionid=c437Group 42001 Feb. 2702:27:2710.0.48.180GET/agn/images/down.gifsessionid=d1402001 Feb. 2703:25:5010.0.48.180GET/reports/ak013/order.gifsessionid=d140
For large log files that are larger than the amount of available random access memory (“RAM”), the grouping of log file entries by session can use a lot of computational resources. For example, conventional grouping techniques involve reading the log file, request-by-request, and sorting the requests into a new file, set of files, database, or index on the file system that is structured to make locating requests in the same session fast. For example, the log files could be imported into a table in a database where each line in the log file is imported as a single record in the database, and where one of the fields in the database record identifies the session the request belongs to. With this arrangement, standard database techniques can be used to sort the table by the session field and then read the records out of the database in session field order. However, this technique requires creating an extra copy of the log file and also significant processing speed penalties in the time required for extracting data from the log files for storage in a database.
Therefore, a need exists for methods and/or apparatuses for improving the processing of log file records to quickly and efficiently transfer data to a session history database. Further limitations and disadvantages of conventional systems will become apparent to one of skill in the art after reviewing the remainder of the present application with reference to the drawings and detailed description which follow.