A primary purpose of the Internet is the distribution of information, e.g., by means of web pages collectively referred to as the web. Almost every company provides web pages for informing business partners or consumers about products or services. Conversely, companies and consumers use the Internet, and more specifically search engines, for identifying suppliers and merchandises as part of electronic commerce.
In addition to a static analysis of web content and web structure, which is known as web content mining and web structure mining, a dynamic analysis of user interaction with web pages is known as web usage mining and reveals whether or not the provided content and structure is aligned to user interests. Web mining techniques are described in “Web Mining-Concepts, Applications and Research Directions”, by J. Srivastava et al., Chapter 3 in Foundations and Advances in Data Mining, Studies in Fuzziness and Soft Computing, volume 180, 2005, pp. 275 to 307.
The insight gained by web usage mining allows optimizing structure and content of web services. For example, the dynamic view of web usage allows a company to assess its own web pages. Web usage mining further allows comparing web services that compete for identical users, and thus contributes to studies known as Competitive Intelligence (CI).
“Web Mining from Competitors' Websites”, by X. Chin et al., KDD 2005, Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 550 to 555, describes a technique for discovering patterns in web re-sources to identify a collection of web pages, objects or resources that are frequently accessed by groups of users with common needs or interests.
Conventionally, companies obtain data representing the web usage of their own website from log files of a web server delivering web pages of the website in response to HTTP requests. In this context, FIG. 1 shows an excerpt 100 of a typical web server log file containing information about each HTTP request 102 to 112 received by the web server.
Depending on the goals of the web usage mining analysis, the log file data is processed or aggregated at different levels. On a first level, a pageview is defined by a set of web objects requested for a user-specific event, such as reading an article, viewing a product or adding a product to a list stored on a server for electronic commerce. For a higher level of aggregation, a session is defined by a sequence of pageviews of a single user during a single visit of the website.
It is important for the analysis to be able to follow the same user over time. However, it is difficult to identify the users based on the web server log files. Internet Protocol (IP) addresses, from which the HTTP requests of a single user originate, are repeatedly changed by an Internet Service Provider (ISP), e.g., when the ISP uses a Dynamic Host Configuration Protocol (DHCP). The user may thus access the website with a different IP address each time, complicating the identification of the same user over time. Partial solutions for identifying the user include, e.g., browser cookies, but not all users allow cookies in the web browser, which makes the solution unreliable in some cases and may cause a bias in the analysis.
Another conventional approach collects data directly from networks of ISPs for web usage mining. For example, the service “Experian Hitwise” aggregates data on user behavior to measure website market share. While the data from ISP networks allows observing clickstreams and user interaction with web resources, a correlation between different HTTP requests associated with a single user may still be difficult or even impossible in certain situations, e.g., since one specific landline Internet access is typically used by a plurality of different persons at the same time.