The World Wide Web (“the Web”) provides a forum for obtaining information and engaging in commercial transactions. In order to provide information and/or solicit a commercial transaction via the Web, a company or other Web publisher establishes a Website. In order to establish a Website, the publisher typically connects its own server computer system to the Internet, or secures the use of a server computer system already connected to the Internet. This server executes a Web server program to deliver Web pages and associated data to users via the Internet in response to their requests. Users make such requests using client computer systems, which are often connected to the Internet via an Internet Service Provider (“ISP”).
As a diagnostic and monitoring measure, some Web server programs maintain a log of the requests that they receive and the actions that they take in response. In some situations, such logs may additionally contain a variety of other information about any type of interaction that a user or computer has with the Web server or the Website provided by the server. Although such logs can contain useful information for analyzing users' interactions with a Website or a Web server, such information can be difficult to extract from Web server log files. Such Web server log files are typically very large, often measured in megabytes or gigabytes; they are full of extraneous information; their content is expressed in a terse form that is difficult to understand; and they are formatted in a manner that makes their content difficult to visually discern. Information about groups of Web pages commonly visited in sequence by users, such as to display information about particular products or to purchase particular products, can be especially difficult to extract from Web server files as these groupings may be interspersed among other events and over time, and almost never appear near each other in a log file. An additional complication is that Websites are typically served by multiple servers, and hence the record of a single user's session of interactions with a Website could appear in multiple separate log files over the course of the session. This further complicates the reconstruction of information and the extraction of patterns of interest.
In performing reporting over web logs or over other types of usage or interaction logs, it can be useful to determine aggregated information such as total number of users, total number of transactions, etc. Often, these numbers are computed and broken down by category and by region or by some other pre-defined partition. It would also be of great use in a variety of situations to be able to determine series or sequences of events of interest that frequently occur and to know the total number of occurrences for such sequences. As used herein, a series of events is a list of consecutive events in the order of their occurrence, and a sequence of events is a list of events in the order of their occurrence but not necessarily in consecutive order. Thus, every series is a sequence, but not every sequence is a series.
As an illustrative example of series and sequences, if a user views Web page A followed by page B followed by page C followed by page D, then the ordered expressions {A,B,C,D}, {C,D}, and {C} all accurately describe series of Web viewing interaction events by that user, but the expressions {A,C,D}, {A,B,E,C,D}, {E}, and {D,C} do not accurately describe series. Similarly, the ordered expressions <A,B,C,D>, <C,D>, <C>, and <A,C,D> all accurately describe sequences of Web viewing interaction events by that user, but the expressions <A,B,E,C,D>, <E>, and <D,C> do not accurately describe sequences. As is shown, since order is relevant in a series or sequence, series {C,D} is different from series {D,C} and sequence <C,D> is different from sequence <D,C>. It should also be noted that sequences can occur multiple times in a group of related events. For example, if a user views Web page A followed by page B followed by page C followed by page B, then the sequence <A,B,B> occurs only once but the sequence <A,B> occurs twice (i.e., Web page A followed by the first viewing of Web page B, and Web page A followed by the second viewing of Web page B).
However, while information about series or sequences of events of interest that frequently occur would be of great use to customers, standard reports about usage and interaction events do not typically provide such information. In particular, information about series of events is not typically computed because its number grows very large and becomes unwieldy, and sequences of events are even more difficult to determine than series. As an example, with only a very small number of 10 possible interaction events, there are 10 possible sequences of length 1, 100 possible sequences of length 2, 1000 possible sequences of length 3, and the number continues to grow exponentially as 10X possible sequences of length X (e.g., 1,000,000 possible sequences of length 6). Thus, to check a group of interaction events to determine all of the possible sequences of a small length such as 6 or less events that are present in the group, it may be necessary to scan the group of interaction events 1,111,110 times to check for each possible sequence.
Accordingly, an automated facility for identifying sequences of events of interest that frequently occur, such as for groups of Web pages of a Website, would have significant utility.