A website analytics system can collect telemetry data for a website. The telemetry data may identify which webpages or other content have been requested or downloaded and by which client computers. For example, a website analytics system can generate various statistics about a website, such as the number of times a particular webpage of the website was served, the number of unique visitors the website received each day over the past month, where the visitors of were geographically located, or what browser each client computer was using. Website administrators may use the statistics to gauge how much advertisers should pay per impression, determine which web pages are most popular, or determine what types of client devices are requesting data or webpages. Website administrators may also use the statistics to determine how to allocate resources, such as hardware or developer time. For example, if the server computers for a website are taking more than a particular amount of time to serve a particular number of webpage impressions, then a website administrator may instantiate additional server computers to reduce the amount of time the website takes serve the same number of webpage impressions.
A website analytics system may generate and receive large amounts of telemetry data for a website. Due, at least in part, to the volume of information generated, the website analytics system may take a long time to process the telemetry data to generate one or more statistics. Accordingly, a website analytics system may dedicate a first set of computers or processes to receive and store telemetry data, and a second set of computers or processes to generate one or more statistics from the telemetry data “offline”. Generating a statistic offline means generating the statistic after the telemetry data used to generate the statistic is received, aggregated or combined with other telemetry data, and stored in persistent storage for subsequently requested reports. In contrast, generating a statistic in “real-time” means generating the statistic while the telemetry data affecting the statistic is being received, or shortly thereafter.
Some statistics may be more useful if generated and acted upon in real-time rather than offline. For purposes of illustrating a clear example, assume a website comprises a set of server computers in a cloud computing infrastructure, where the set of server computers are configured to support up to 100 different active client computers concurrently streaming video. Assume further that the website administrator may allocate or deallocate cloud resources to the website at will. If more than 100 client computers are actively trying to stream video from the website, then the website may be overwhelmed and the client computers may have intermittent pausing or longer than expected downloading times. The website administrator may not realize the current number of server computers allocated to streaming video is insufficient until users begin to complain or a latent offline statistic or report is generated. In contrast, if a website administrator could determine in real-time how many client devices are currently streaming video, then the website administrator could allocate or deallocate resources, such as cloud server computers, as needed. However, determining a statistic, such as how many client devices are active or currently downloading video, may be difficult.
One way to determine how many client devices are actively communicating with a webserver is to maintain a database table of active devices comprising device identifiers that identify which devices have requested data within a particular amount of time. For example, each time a client computer requests data from a server computer, a tracker process may store an identifier of the client computer and a current timestamp in a record of the database table. An identifier may be the IP address or MAC address of the client computer or a username of a user using the client computer. If a request is received from an identifier already in a record in the database table, then the tracker process update the timestamp in the record to correspond to the current time. One or more reaper processes may periodically review the list of addresses and remove addresses from the table that do not have a timestamp that corresponds to a time that is within a particular range of the current time. One or more aggregation processes may periodically count the number of addresses are currently in the database table to determine how many client computers are connected to the webserver. The one or more aggregation processes may store the counts in a database for future reports generated offline.
A system that implements the method discussed above can require a massive amount of computational and engineering resources. For example, to make sure a one process, such as a tracker process, does not try to add or update an entry, while a second process, such as a reaper process or aggregation process, scans or deletes an entry from the database, the database may impose locks on the database table. Locking may quickly slow down performance of the database to the point where the aggregation processes are no longer accurately counting how many clients are actively communicating with the webserver. Accordingly, a website may maintain the database table as one or more shards distributed across one or more database server computers, which may allow a process to lock a shard or row in the share without locking the portions of the table in other shards. The website may maintain a large number of web server computers coupled to the database servers to serve content and make database requests to update the database table when content is requested by each client computer. The website may maintain one or more computers and/or processes to maintain coherency and/or execute the reaper processes and/or the aggregation processes.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
While each of the drawing figures illustrates a particular embodiment for purposes of illustrating a clear example, other embodiments may omit, add to, reorder, and/or modify any of the elements shown in the drawing figures. For purposes of illustrating clear examples, one or more figures may be described with reference to one or more other figures, but using the particular arrangement illustrated in the one or more other figures is not required in other embodiments.