The present invention is generally related to data mining, and in particular, a method and system for efficiently mining web log records (WLRs).
Commercial web sites typically generate large volumes of web log records (WLRs) on a daily basis. Collecting and mining web log records (WLRs) from e-commerce web sites have become increasingly important for targeted marketing, promotions, and traffic analysis. Because an active web site may generate hundreds of millions of WLRs daily, any web-related data mining application must deal with huge data volumes and high data flow rates.
These WLRs can be collected and mined to extract customer behavior patterns, which may then in turn be used for a variety of business purposes. These business purposes can include, for example, making product recommendations, designing marketing campaigns, or re-designing a web site. In order to support fine-grained analysis, such as determining individual users"" access profiles, these data mining applications must handle huge, sparse data cubes that are defined over very large-sized dimensions. For example, there may be hundreds of thousands of visitors to a particular site of interest, and tens of thousands of pages associated with the web site of interest.
Numerous commercial tools are available for analyzing WLRs and records from other data sources and generating reports for business managers. Two examples of such commercial tools are the WebTrends product (see http://www.webtrends.com) and the NetGenesis product (see http://www.netgenesis.com). Unfortunately, these prior art tools have several disadvantages. First, these prior art tools typically provide only a fixed set of pre-configured reports. Second, these prior art tools have limited on-line analytical capabilities. Third, these prior art tools do not support more sophisticated data mining operations, such as customer profiling or association rules.
The inventors have proposed the use of on-line analytical processing (OLAP) tools to support complex, multi-dimensional and multi-level on-line analysis of large volumes of data stored in data warehouses. For example, in a paper entitled, xe2x80x9cA Distributed OLAP Infrastructure for E-Commercexe2x80x9d, written by Q. Chen, U. Dayal, M. Hsu, Proc. Fourth IFCIS Conference on Cooperative Information Systems (CoopIS""99), United Kingdom 1999, a scalable framework is described that is developed on top of an Oracle-8 based data warehouse and a commercially available multi-dimensional OLAP server, Oracle Express. This scalable framework is used to develop applications for analyzing customer calling patterns from telecommuication networks and shopping transactions from e-commerce sites.
It is desirable to implement a Web access analysis engine on this framework to support the collection and mining of WLRs at the high data volumes that are typical of large commercial Web sites. Unfortunately, there are several challenges (e.g., performance and functionality problems) that must be addressed before such a web access analysis engine can be implemented.
One challenge is how to handle the processing of very large, very sparse data cubes. While a data warehouse/OLAP framework is capable of dealing with huge data volumes, the OLAP framework does not guarantee that the summarization and analysis operations can scale to keep up with the input data rates. Specifically, Web access analysis introduces a number of fine-grained dimensions that result in very large, very sparse data cubes. These very large, very sparse data cubes pose serious scalability and performance challenges to data aggregation and analysis, and more fundamentally, to the use of OLAP for such applications.
While OLAP servers generally store sparse data cubes quite efficiently, OLAP servers generally do not roll-up these sparse data cubes very efficiently. For example, while most MOLAP and ROLAP engines provide efficient mechanisms for caching and storing sparse data cubes, the engines lack efficient mechanisms for rolling-up such cubes. As illustrated in the example set forth herein below, the time required for prior art OLAP engines to roll-up a large sparse data cube can take prohibitively long. For example, the processing time required for prior art OLAP engines to roll-up a large sparse data cube can far exceed the minimum time between the receipt of a first data set and the receipt of new data set. As can be appreciated, if the time needed to process and summarize the first data set exceeds the time between the receipt of the first data set and the receipt of the new data set, the system can never keep up with the new data.
For example, in one application, a newspaper Web site received 1.5 million hits a week against pages that contained articles on various subjects. The newspaper wanted to profile the behavior of visitors from each originating site at different times of the day, including their interest in particular subjects and which referring sites they were clicking through. The data is modeled by using four dimensions: ip address of the originating site (48,128 values), referring site (10,432 values), subject uri (18,085 values), and hours of day (24 values). The resulting cube contains over 200 trillion cells, indicating clearly that the cube is extremely sparse. Each of the dimensions participates in a 2-level or 3-level hierarchy. To rollup such a cube along these dimension hierarchies by using the regular rollup operation supported by the OLAP server requires an estimated 10,000 hours (i.e. more than one year) on a single Unix server. As can be appreciated, the processing time required is unacceptable for the application.
Accordingly, mechanisms are desired that can efficiently summarize data without having to roll-up sparse data cubes. Unfortunately, the prior art approaches fail to offer these mechanisms.
Based on the foregoing, a significant need remains for a system and method for efficiently analyzing web log records.
According to one embodiment of the present invention, a method for analyzing web access is provided. First, a plurality of web log records is received. Next, multi-dimensional summary information is generated based on the web log records. Then, derivation and analysis are performed to discover usage patterns and rules for supporting business intelligence by using the multi-dimensional summary information.
According to another embodiment of the present invention, a system for analyzing web access is provided. The system has a source of web log records and an OLAP engine. When executing a web access analysis program, the OLAP engine receives a plurality of web log records, generates multi-dimensional summary information based on the web log records, and performs derivation and analysis to discover usage patterns and rules for supporting business intelligence by using the multi-dimensional summary information.
Preferably, the web access analysis program includes a feature ranking facility for generating multilevel and multidimensional feature ranking cubes for ranking web access along multiple dimensions and at multiple levels. For example, the feature ranking facility generates a first cube for ranked list of elements of a particular dimension, where a feature is represented by a dimension, and a second cube for one of volume and probability distribution corresponding to the ranked list of elements of a particular dimension.
The web access analysis program can also include a correlation analysis facility for performing correlation analysis on the summary information to generate association rules for use in web access analysis. For example, the correlation analysis facility can generate multilevel association rules with flexible base and dimensions or time-variant association rules.
The web access analysis program can also include a direct binning facility for concurrently generating a volume cube based on the plurality of web log records and directly generating a high diagonal cube based on the plurality of web log records.