The present invention relates to on-demand web analytics, and in particular to analyzing web activity in relation to user segments.
The field of Web Analytics involves the collection of large amounts of data characterizing an internet user's web usage behavior. Examples of data that can be collected include the web page from which a user came, the pages viewed on a web site, if a purchase was made from the web site, the products viewed on the web site, etc. Such data can be collected for any number of different users over any number of different periods of time. The field of web analytics is dedicated to examining the vast amounts of data collected and determining if there are any useful patterns or similarities in the data across different users or groups of users. Discovery of such patterns or similarities may allow a web site owner to customize their web site in order to be more conducive to achieving certain goals, such as increasing sales.
One task that may be performed by a web analytics system is determining what web activities users satisfying certain criteria have also engaged in. For example, an administrator of an e-commerce web site, who may also be referred to as a client of the web analytics system, may wish to know which web pages have been viewed by users who made a purchase. Such information could be useful to the administrator in determining which pages are most likely to result in a sale. Another example may be of all the users who made a purchase in one month, what pages were viewed by those users in a subsequent month.
Web analytics systems capture vast amounts of data related to user's behavior on web sites. In some cases, analytics systems capture data related to every single page a user views, as well as any item on the page that the user clicks. Although the captured data is vast, it is relatively meaningless until processed to reveal patterns that are of value to a web site administrator. Often times, the administrator herself may not know what data will be meaningful at the time of collection. The administrator may wish to pose numerous “what if” type questions to the web analytics system in order to reveal useful data. For example, an administrator may ask “of users who purchased a product in January, how many viewed pages in February?” After receiving the results, it may be the case that no interesting patterns are revealed. The administrator may then ask “of users who purchased a product in February, how many viewed pages in January?” The second query may reveal that users who make purchases on the web site generally only do so after multiple visits.
Providing the ability for a web site administrator to query web usage data using unknown criteria poses several challenges to designers of web analytics systems. One of these challenges is the vast amount of collected data may not necessarily be structured in the form that is most conducive to answering the particular query presented. Continuing with the previous example, the web site administrator may have a list of all users that made a purchase in a given month. These users may be identified by a user ID. The group of users who satisfy the specified criteria may be referred to as a segment. The administrator may wish to know what web pages were viewed in the previous month by the users in the segment.
A naïve approach to this problem would entail the use of a nested loop join. An outer loop would iterate over each stored web activity. For each activity, an inner loop will compare the user ID associated with the activity to each user ID of users that made a purchase. If a match occurs, the web activity is added to the results set. Once each user ID in the set of users who made a purchase has been iterated, the process repeats, moving on to the next stored web activity. The nested loop join approach is extraordinarily inefficient and thus time consuming. In a best case scenario, where each and every stored web activity is associated with a user who made a purchase, on average each web activity would be compared to ½ the total number of users who made a purchase. In a more realistic scenario, some, if not most, of the stored web activities will be associated with user IDs that did not make a purchase. In such a case, each activity associated with a non-purchasing user ID would be compared to the full set of user IDs that made a purchase, only to determine that the activity should not be added to the results set.
An alternative approach that is slightly more efficient would be to use a lookup structure to perform a hash join. Each user ID that made a purchase could be loaded into a lookup structure, such as a hash table. The user ID for each web activity can then be hashed to determine if the user ID is contained in the hash table. The hash join approach has its own disadvantages. One problem occurs when a large number of users have made a purchase. The hash table itself may not fit into memory, requiring expensive and inefficient swapping of portions of the hash table into and out of memory. Furthermore, there is the complexity of properly sizing the hash table. Choosing too large a size results in a sparsely populated table, which could lead to the memory problems discussed above. Choosing too small a size results in large numbers of user IDs hashing to the same hash bucket, which would then result in further processing of all the entries contained in that hash bucket.
Yet another approach involves sorting both the web usage activity and the user IDs of those users who made a purchase. This may be referred to as a sort merge join. The algorithm for performing a sort merge join is complex and inefficient. Although the sort merge join may provide some efficiencies in the matching stage, the processing required to sort is excessive. When the set of stored web activities is large, as it is expected to be in web analytics applications, the sorting step alone becomes prohibitively expensive.
Embodiments of the present disclosure provide systems and methods for efficient filtering of usage activity based on segments.