1. Field of Invention
The present invention relates generally to the field of web usage mining. More specifically, the present invention is related to analysis of access logs (e.g., web access logs) to provide insight into user behaviors.
2. Discussion of Prior Art
Enterprise-level Web analytics tools that transform Web log data into valuable e-business intelligence are becoming increasingly important since they provide a clear picture of the overall health and integrity of any e-business infrastructure. As a result, Web usage mining—the application of data mining techniques to discover usage patterns from Web log data—has been an active area of research and commercialization. By capturing, analyzing, storing, and reporting on web site usage, such tools provide essential metrics on visitor site interactions and the site's overall performance. This insight is often used to optimize the site for increased customer loyalty and e-business effectiveness. Usage characterization, Web site performance improvement, personalization, adaptive site modification, and market intelligence are some of the applications of Web usage mining as described in the articles entitled “Discovery of Interesting Usage Patterns from Web Data” and “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data,” both by Cooley et al.
Identification of user interests, understanding user behavior, and tracking the popularity of pages are key ingredients to being successful in a competitive eCommerce marketplace. Web logs are studied and analyzed to indicate where a decrease in investment or possible change in web navigation should occur due to less visited Web or product pages. In commercial products, Web site effectiveness is frequently measured by correlating Web usage and traffic information with performance and availability metrics.
Path analysis is usually the basis of many Web analytics tools—its goal is to help understand a visitor's navigation of a web site. Path analysis can be simply defined as the list of pages that a visitor traverses in one visit. While this provides the exact, complete path for each visitor, it may not provide useful insights in terms of visitor behaviors. Therefore, various modifications of path analysis have been proposed such as a focused path analysis (limited list of pages, in order, that a visitor traverses in arriving at or departing from a particular page). Further enhancements include grouping site URLs and performing a path analysis on these groups rather than individual URLs. Ultimately, the path analysis serves to classify visits as “success” or “failure” against certain business objectives of making a sale and can be the basis of web site redesign. Another technique for getting insight into behavior is to look at the most popular behavior—i.e., tracking the frequency of each URL or group to understand flow.
Pattern discovery from Web logs draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition as described in the articles entitled “Fast Algorithms for Mining Association Rules,” by Agrawal et al.; “Mining Sequential Patterns,” by Agrawal et al.; “From Data Mining to Knowledge Discovery: An Overview,” by Fayyad et al.; “Data Mining: An Overview from Database Perspective,” by Chen et al.; “From User Access Patterns to Dynamic Hypertext Linking,” by Jacobsen et al.; and “Towards On-Line Analytical Mining in Large Databases,” by Jiawei Han. Statistical techniques are the most common method to extract knowledge about visitors to a Web site. By analyzing the session file, one can perform different kinds of descriptive statistical analyses (frequency, mean, median, etc.) on variables such as age views, viewing time and length of a navigational path. Many Web traffic analysis tools produce a periodic report containing statistical information such as the most frequently accessed pages, average view time of a page or average length of a path through a site. Despite lacking in the depth of its analysis, this type of knowledge can be potentially useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, and providing support for marketing decisions. Some examples of commercial products based on this type of analysis are Netperceptions®, Netzero®, Surfaid analytics, Truste: Building a Web you can believe in, and Webtrends® log analyzer.
Association rule generation can be used to relate pages that are most often referenced together in a single server session. Association rules refer to sets of pages that are accessed together with a support value exceeding some specified threshold. For example, association rule discovery using the a-priori algorithm described in the article entitled “Fast Algorithms for Mining Association Rules,” by Agrawal et al., may reveal a correlation between users who visited a page containing electronic products to those who access a page about sporting equipment. However, with association rule discovery, the notion of a transaction for market-basket analysis does not take into consideration the order in which items are selected. The technique of sequential pattern discovery attempts to find inter-session patterns such that the presence of a set of items is followed by another item in a time-ordered set of sessions or episodes. By using this approach, Web marketers can predict future visit patterns which will be helpful in placing advertisements aimed at certain user groups.
Other types of temporal analysis that can be performed on sequential patterns include trend analysis, change point detection, or similarity analysis as described in the article entitled “Mining Sequential Patterns,” by Agrawal et al. Dependency modeling is another useful pattern discovery task in Web Mining. The goal in dependency modeling is to develop a model capable of representing significant dependencies among the various variables in the Web domain. There are several probabilistic learning techniques that can be employed to model the browsing behavior of users. Such techniques include Hidden Markov Models and Bayesian Belief Networks as described in articles entitled “Link Prediction and Path Analysis Using Markov Chains,” by R. R. Sarukkai, and “On Learning Video Browsing Behavior from User Interactions,” by Westphal et al. The article entitled “The Link Prediction Problem for Social Networks,” by Kleinberg et al., develops approaches to link prediction based on measures of the proximity of nodes in a network.
Projects described in articles entitled “Discovery of Interesting Usage Patterns from Web Data,” by Cooley et al.; “Web Usage Mining for Web Site Evaluation,” by Spiliopoulou, M.; “Speedtracer: A Web Usage Mining and Analysis Tool,” by Wu et al.; “Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs,” by Zaiane et al.; and “Knowledge Discovery from Users Web-Page Navigation,” by Shahabi et al., have focused on Web Usage Mining in general, without specific focus of their Web mining techniques. The SpeedTracer project makes use of referrer and agent information in the preprocessing routines to identify users and server sessions in the absence of additional client side information. The Web Utilization Miner (WUM) system as described in the article entitled “WUM: A Web Utilization Miner,” by Spiliopoulou et al., provides a robust mining language in order to specify characteristics of discovered frequent paths that are interesting to the analyst. In their approach, individual navigation paths, called trails, are combined into an aggregated tree structure.
A concept hierarchy, also known as taxonomy, generalizes concrete URLs into more abstract concepts. Concept hierarchies are also useful in data mining, especially for market-basket analysis as described in the article entitled “Data Mining Techniques for Marketing, Sales,” by Berry et al. The analyst groups individual products into more general concepts, with the effect of also grouping purchases of the products together. Thus, associations that are too rare among individual products become apparent when the product groups are studied.
Sequence miners as described in the article entitled “Mining Sequential Patterns,” by Agrawal et al., discover typical usage patterns by determining accesses to pages that occur frequently together in the same order. Only the designer of the site is aware of the larger tasks within which all detected patterns must be analyzed and evaluated. It would be much more efficient to automatically test the miner's results against the expectations of the designer. Therefore, enhancements need to be made in the field of miners so that more than just frequent sequences are found.
The patent to Howard et al. (U.S. Pat. No. 6,278,966 B1), assigned to International Business Machines Corporation, provides for a Method and System for Emulating Web Site Traffic to Identify Web Site Usage Patterns. It discusses a method for emulating behaviors of web site visitors for producing web site trend analysis data. Data mining association rules are applied to simulated traffic and used to identify usage patterns for users of a web site. Actions of users are tracked and reference distributions are developed that are compared to a site's usage distributions as obtained from actual visitors to the site. The reference distributions are used to implement statistical methods that measure relative information content.
The patent application publication to Tamayo et al. (2002/0083067 A1) provides for an Enterprise Web Mining System and Method. It discusses a method of enterprise web mining wherein a plurality of data mining models are generated using data that is collected from a plurality of data sources such as account or user based data, complementary external data, web server data and web transaction data. Predictions or recommendations are provided using the data mining models.
The patent to Papierniak et al. (U.S. Pat. No. 6,151,601), assigned to NCR Corporation, provides for a Computer Architecture and Method for Collecting, Analyzing and/or Transforming Internet and/or Electronic Commerce Data for Storage Into a Data Storage Area. It illustrates a method for effectively collecting, translating, refining, and analyzing Internet and/or electronic commerce data to provide useful marketing information. Web data is integrated with business data from a plurality of sources.
The patent to Martin et al. (U.S. Pat. No. 6,338,066 B1), assigned to International Business Machines Corporation, provides for a Surfaid Predictor: Web-Based System for Predicting Surfer Behavior. Web surfers behavior is predicted based on past surfer behavior. Multiple models of surfer behavior are generated by randomly selecting sample sessions from a web log.
The patent application publication to Lee et al. (2002/0198939 A1), assigned to International Business Machines Corporation, provides for a System and Method for Collecting and Analyzing Information about Content Requested in a Network (World Wide Web) Environment. A method for collecting, analyzing, aggregating and storing information about the content of one or more web pages served by a server on a network is discussed.
The patent application publication to McGuire (2003/0126613 A1) provides for a System and Method for Visualizing User Activity. It discusses a method for analyzing web server logs or other computer generated activity logs and converting the information contained in the logs, i.e., the log data into a visual, audio or audio/visual recreation of a user's accessing of a web site.
In contrast to Internet eCommerce sites that may optimize web site design to make a sale or obtain some personal information about a user, intranet Web applications have a different goal. Many corporate processes, such as procurement, human resources, travel reservations, and expense reimbursement, have a Web front that accesses, displays, and updates data on different backend servers. As an example, a global corporation such as IBM has over 1,000 Web applications supporting its business processes for its 300,000+ world-wide employees. The purpose of a web site, in this case, is to support a given process that needs to be performed in the most efficient manner. Free-form discovery of popular visitor paths is not necessarily insightful in evaluating the efficacy of such web sites. Instead, the web site is typically designed with a set of features to meet a set of requirements of the process it serves. Metrics that are relevant to evaluating such web sites are task-oriented—e.g., how effective was the site in getting the task accomplished, how long it takes to complete a specific task, what are the trends over time across different user populations in the corporation, etc.
One of the main problems with web logs analysis is that a single task performed by a user is composed of accessing multiple URLs. The same task may be performed in different ways, yet resulting with the same outcome. For example, upon accessing the website to buy a product, the product description page could be reached in a number of ways where the user could click to buy, which would lead to same final outcome, buying of the product. Thus, the sequence of URLs that the user accesses to buy the product is one task. Hence, it is beneficial to perform analysis of tasks rather than just sequences of URLs that the user accesses in a session. Thus, it is valuable to process a sequence of URLs and detect the semantics of the different tasks performed by the user.
The references and techniques described above provide for web log and user activity analysis. However, none of them discuss the ability to define patterns that represent entire tasks of interest using a formal grammar. Whatever the precise merits, features, and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention.