Extracting program events from running programs and placing a record of these events in a database for subsequent processing can be particularly difficult when the events occur at a very high rate.
The running program can be of any type. An example is a server (such as a Web server) that processes requests from clients. Events can be the arrival of such requests, as well the completion of the servicing of requests. Events could be anything else, such as a failure being encountered, a change of condition in system resource availability, etc.
For example, consider a large Internet service that uses a farm of Web servers to expose their content to end users. Such Web farms can have hundreds or thousands of individual Web servers. Every time a user views a particular page, an event is triggered. Such an Internet service would like to record these events, in order to analyze them (also known as clickstream analysis).
Moreover, if additional attributes are recorded along with the event, then the quality of the analysis can increase. Analyzing clickstreams can convey extremely valuable information that can be used in determining user demographics and preferences, tracking usage metrics for products and marketing campaigns by various attributes (type, country, etc.). Executives can track growth trends for the Web site as a whole, while individual business units can drill down and track their specific programs and products on predefined user segments. For such analysis to be effective, additional information must be recorded with each click (e.g., information about the user, how long the processing took, etc.)
Several approaches have been proposed to solve this challenge. For example, the logs generated by the server of interest (e.g., the Web server) can be harvested and processed. Another approach is to instrument the responses returned to end users in a way that will cause the Web browsers of those end users to automatically report events (e.g., tagging Web pages with active code). And finally, there is the approach of extracting the events directly from the running server.
In the log-processing approach, logging is turned on in the server (such as a Web server, application server, database server, any other kind of server) and the resulting logs are then collected. These logs are then parsed and interpreted, and either deposited in a database or some other form of repository. The process of taking these logs and placing them into a repository is often called ETL (Extract-Transform-Load).
One drawback of the server-processing approach is that it can lead to the data in the database being insufficiently current for the data analytics. For example, it may take a significant period of time for the logs to be obtained and processed: during this time, the data in the logs will be unavailable for analysis and the value of the data reduces as its freshness drops.
Conventional Web analytics companies often use a “Web beacon” technique to capture traffic data (formerly known as the “Web bug” approach). This approach requires modifying the production code of a Web property to insert into the Web pages of interest a small 1×1 pixel image or some JavaScript code that carries information about the particular page view. The URL of the pixel (or the JavaScript) points to the servers of the Web analytics company, where information about the initial request is logged. The analysis of the logged data happens through online interfaces that generate Web analytics reports.
Although the above model is currently used at many small and medium size Web sites, it presents significant limitations for use in large scale environments that have stringent requirements for freshness, availability, and visibility into user behavior. Conventional Web analytics companies often struggle at top Web properties; loading and analyzing the clickstream data can become unacceptably slow, the amount of history is often small and customers have to compromise either data detail or time horizon. The end result is that large scale Web analytics become very expensive due to the nonlinear increase in the cost of these systems, reaching many millions of dollars per year for a large site.
The problem here is that, on one hand, there is increased inefficiency in the event collection process: for the event to be recorded, some information is embedded in the result sent to the end user; the end user then automatically acts on that information and sends information to yet another service (in some sense, another event). Typically, a browser automatically fetches the Web beacon and generates an HTTP request to the Web analytics service provider, which then records it. This costs time, processing power, and network bandwidth.
Another fundamental limitation of Web beacons is that they cannot capture requests for non-HTML content, such as images, streaming media, PDF's, etc. With media content becoming increasingly more important for Web properties, this limitation has a serious impact on the value of the analytics solution.
The direct event extraction approach can consist of placing a special piece of code in the server that witnesses the various events, and then extracting the event directly from there to the target repository.
The main challenge in direct event extraction is that the database on the receiving end of these events must be able to sustain the high rates at which events are generated. For example, a service with 3,000 Web servers can receive 3,000,000 clicks per second at peak time, which means that at least 3,000,000 events must be extracted and inserted into a database every second. If only 1 KB of data is collected for each click, then aggregate data bandwidth will exceed 3 Gigabytes/sec. In this example, a database would have to be capable of performing an impractical 3,000,000 transactions per second if each event were directly provided to the database.