1. Field of the Invention
The present invention relates to a computer system, and deals more particularly with a method, system, and computer program for collecting information about user behavior in the presence of dynamic page content.
2. Description of the Related Art
Today, thousands of businesses and millions of people are using the Internet on a daily basis. The Internet is a vast collection of computing resources, interconnected as a network, from sites around the world. The World Wide Web (referred to herein as the "Web") is that portion of the Internet which uses the HyperText Transfer Protocol ("HTTP") as a protocol for exchanging messages. (Alternatively, other protocols such as the "HTTPS" protocol can be used, where this protocol is a security-enhanced version of HTTP.)
A user of the Internet typically accesses and uses the Internet by establishing a network connection through the services of an Internet Service Provider (ISP). An ISP provides computer users the ability to dial a telephone number using their computer modem (or other connection facility, such as satellite transmission), thereby establishing a connection to a remote computer owned or managed by the ISP. This remote computer then makes services available to the user's computer. Typical services include: providing a search facility to search throughout the interconnected computers of the Internet for items of interest to the user; a browse capability, for displaying information located with the search facility; and an electronic mail facility, with which the user can send and receive mail messages from other computer users.
The user working in a Web environment will have software running on his computer to allow him to create and send requests for information, and to see the results. These functions are typically combined in what is referred to as a "Web browser", or "browser". After the user has created his request using the browser, the request message is sent out into the Internet for processing. The target of the request message is one of the interconnected computers in the Internet network. That computer will receive the message, attempt to find the data satisfying the user's request, format that data for display with the user's browser, and return the formatted response to the browser software running on the user's computer.
This is an example of a client-server model of computing, where the machine at which the user requests information is referred to as the client, and the computer that locates the information and returns it to the client is the server. In the Web environment, the server is referred to as a "Web server". The client-server model may be extended to what is referred to as a "three-tier architecture". This architecture places the Web server in the middle tier, where the added tier typically represents data repositories of information that may be accessed by the Web server as part of the task of processing the client's request. This three-tiered architecture recognizes the fact that many client requests do not simply require the location and return of static data, but require an application program to perform processing of the client's request in order to dynamically create the data to be returned. In this architecture, the Web server may equivalently be referred to as an "application server".
When this scenario is implemented using the Internet, the browser running on the client's machine accepts the data it will display in response to the user's request, by convention, as a data stream formatted using the HyperText Markup Language ("HTML"). HTML is a standardized notation for displaying text and graphics on a computer display screen, as well as providing more complex information presentation such as animated video, sound, etc. Because browsers expect an incoming response to be formatted using HTML, servers generate their response in that format. The browser processes the HTML syntax upon receipt of the file sent by the server, and creates a Web page according to the instructions specified by the HTML commands.
Web pages were originally created to have only static content. That is, a user requested a specific page, and the predefined contents of that page were located by a Web server and returned for formatting and display at the user's computer. To change the page content or layout, the HTML syntax specifying the page had to be edited. However, the Web is moving toward dynamic page content, whereby the information to be displayed to the user for a given page can be generated dynamically, without changing the HTML.
With dynamically-generated content, a request for the Web page stored at a given Uniform Resource Identifier ("URI") or Uniform Resource Locator ("URL") may result in a wide variety of page content being returned to the user. (References to "URL" hereinafter are intended to include URIs unless stated otherwise.) One common, simple use of dynamic page content is the "visitor counts" which are often displayed on Web pages, with text such as "You are the 123rd visitor to this site since Jan. 1, 1997"(where the count of visitors is accumulated at the server and inserted into the HTML syntax before returning the page to the user). Other simple uses include displaying the current date and time on the page. More advanced techniques for dynamic content allow servers to provide Web pages that are tailored to the user's identification and any profiles of personal information he may have created. For example, servers providing travel reservation services commonly store information about the travel preferences of each of their users, and then use this information when responding to inquiries from a particular user. Dynamic content may also be based upon user classes or categories, where one category of users will see one version of a Web page, and users in another category will see a different version--even though the same URL was used to request the Web page from the same server. For example, some Web server sites provide different services to users who have registered in some manner (such as filling out an on-line questionnaire) or users who have a membership of some type (which may involve paying a fee in order to get enhanced services, or more detailed information). The difference in dynamic content may be as simple as including the user's name in the page, as a personalized electronic greeting. Or, the dynamic content may be related to the user's past activities at this site. On-line shopping sites, for example, may include a recognition for repeat shoppers, such as thanking them for their previous order placed on some specific day.
A number of techniques for providing dynamic page content exist. One such technique is use of an Active Server Page ("ASP") on a Microsoft Web server, which detects a specific command syntax in an HTML page and process the embedded commands before returning the page to the user. Another technique is use of servlets, which are relatively small executable code objects that can be dynamically invoked by code running on the server. Servlets typically perform some specialized function, such as creating page content based on dynamic factors. Or, Dynamic Server Pages ("DSPs") may be used to create dynamic content using compiled Java on Java-aware Web servers. ("Java" is a trademark of Sun Microsystems, Inc.) CGI ("Common Gateway Interface") scripts and applications may also be used as sources of dynamic content.
Dynamic page content that is customized to an individual user is made possible by software running at a Web server which tracks visitors to the Web site. This tracking enables a Web administrator to monitor who is visiting the site, what content they request to see, how that content affects their behavior (whether they exit the site from a specific page, link from one page to another, etc.), and so forth. By monitoring visitors in this way, the server applications can provide targeted marketing and customized information to each visitor. As electronic commerce becomes more prevalent on the Web, tracking this type of user behavior information will be increasingly more valuable.
Many tools exist today for monitoring user access to Web servers. These monitoring tools typically generate traces of URL requests from individual user sessions. This information is recorded in a file, database, or other repository accessible to the server applications. However, existing tools are oriented towards static page content, where tracking the URL of the request provides the ability to reconstruct what content was displayed to the user as he navigated around the site. When dynamic page content is displayed, recording the URL request flow is insufficient to provide a record of this information. As stated previously, requests to a single URL may result in very different Web page content based upon factors such as the requesting user's identity, so that storing the URL does not provide meaningful data for monitoring the user's behavior.
Accordingly, a need exists for a technique by which information about user behavior in the presence of dynamic page content can be collected. A Web site monitoring tool using this collection technique to create user profiles must be able to contend with a range of dynamic page content. The proposed technique uses regular expressions to describe dynamic page content and classify pages into equivalence classes.