The Internet is a vast collection of computing resources, interconnected as a network, from sites around the world. It is used every day by millions of individuals. The World Wide Web (referred to herein as the “Web”) is that portion of the Internet that uses the Hyper text Transfer Protocol (“HTTP”) as a protocol for exchanging messages. (Alternatively, the “HTTPS” protocol can be used, where this protocol is a security-enhanced version of HTTP.)
A user of the Internet typically accesses and uses the Internet by establishing a network connection through the services of an Internet Service Provider (ISP). An ISP provides computer users the ability to dial a telephone number using their computer modem (or other connection facility, such as satellite transmission), thereby establishing a connection to a remote computer owned or managed by the ISP. This remote computer then makes services available to the user's computer. Typical services include: a search facility to search throughout the interconnected computers of the Internet for files of interest to the user; a browse capability for displaying information files located with the search facility; and an electronic mail facility, with which the user can send and receive mail messages from other computer users.
The HTTP communications protocol uses a request/response paradigm, where the electronic messages sent between communicating computers can be categorized as either requests for information or responses to those requests.
The user working in a Web environment will have software running on his or her computer to allow him or her to create and send requests for information onto the Internet, and to see the results. These functions are typically combined in a software package that is referred to as a “Web browser”, or “browser”. After the user has created a request using the browser, the request message is sent out into the Internet (typically, via an ISP as described above). The target of the request message is one of the interconnected computers in the Internet network. That computer receives the message, attempts to find the data satisfying the user's request, formats that data for display with the user's browser, and returns the formatted response to the browser software running on the user's computer.
This is an example of a client-server model of computing, where the computer at which the user requests information is referred to as the client or client machine, and the computer that locates the information and returns it to the client is the server or server machine. In the Web environment, the server is referred to as a “Web server”.
Content on the Internet is served in individual files in the form of HTML pages. HTML (Hyper Text Markup Language) is a Web content formatting language specifically designed for a distributed network such as the Internet. An HTML page contains HTML code, which indicates how the information content is to be displayed, as well as at least some of the actual content. Pages also typically contain references to other files where at least some of the content is contained. Web browser software is designed to issue requests for pages in the form of URLs (Universal Resource Locators). A URL essentially is an address of a file that is accessible through the Internet. The URL includes the name of the file that is being requested and the IP (Internet Protocol) address of the server on which it is to be found.
A user at a client machine may type a URL into an appropriate field in a GUI (Graphical User Interface) generated by the Web browser software in order to address Web pages. Another way of addressing Web pages is by hyperlinking. A hyperlink is a portion in one Web page, such as a portion of text or an image, that, when selected (such as by positioning a cursor over that portion and pressing a button on the cursor control device), automatically addresses another Web page. Thus, for example, by manipulating one's mouse to cause the screen cursor to move over a hyperlink and clicking, the page addressed by that hyperlink is accessed by the browser.
Each request is routed through the Internet to the server identified in the URL. That server then returns the requested page through the Internet to the client machine that requested it. The Web browser software reads the HTML code in the page and, if that page contains references to other files containing some of the content, the browser software sends further requests for those files. It displays the content (whether contained directly in the HTML page or in another file referenced within the HTML page) in a manner dictated by the HTML code in the page.
Countless commercial, educational, government and other institutions operate servers containing HTML pages that are accessible to client machines via the Internet. The term “Web site” generally refers to a collection of HTML pages that are maintained on (or generated on-the-fly by) one or more servers by or on behalf of a single entity and that are related to each other in some fashion.
HTTP does not provide for maintaining any type of state information about the communications, instead treating each request/response pair as a separate and unrelated transaction. However, there are many cases for which it is desirable to associate multiple HTTP requests from a client to a server with each other so as to be able to maintain state information.
One example scenario where state information is an absolute necessity is on-line shopping, including the gathering of user profile information. In on-line shopping, a user typically accesses a seller's on-line catalog, which will be displayed to the user as some number of Web pages. Typically, the user can display a separate page of information related to each product, to read about the details of that product. Typically, each time the user requests to see a page, a separate HTTP request is sent to the Web server where the seller's product catalog is stored. When the user wishes to order a product, he indicates his selection by clicking on an “Order” button of some type using a mouse, for example. This causes another request message to be sent to the server, where the request indicates that this is an order for the particular item.
Without the ability to maintain state information, each of these requests would be treated as unrelated to the others. There would be no efficient way to collect orders for more than one item into one large order. Further, there would be no efficient way to allow the user to enter his name, address, credit card number, etc. only one time, and have that information apply to all the ordered items.
Even further, it also frequently is desirable to be able to maintain state information across multiple, separate, visits by a particular individual to a particular Web site. For instance, it may be desirable for a retail Web site to store all of the information that it typically gathers to process a purchase order by an individual and associate that information with the individual every time he or she visits the Web site. Then the individual will not need to re-enter the same information, such as name, credit card No., billing address, shipping address, etc., every time he or she visits the Web site and purchases an item.
Accordingly, ways have been developed outside of the HTTP protocol itself for maintaining such state information. One of the earliest ways developed for doing this was the use of cookies.
Cookies are small data files that a server might send to a client machine and that the client's Web browser knows to store in a designated cookie folder. A cookie contains pertinent information about the user as well as information that the browser uses to determine the particular Web site (i.e., URL) to which the cookie pertains. Thereafter, when that client machine sends a HTTP request for a Web page meeting the URL criteria set forth in the cookie, the client's Web browser software includes that cookie in the request. The purpose of cookies is to inform a server of relevant information about the particular user (or at least the particular client machine that issued the request). Cookies might contain any particular information that a Web site operator feels the need to have in order to better service its customers.
URL rewriting is a technology that can serve most of the same functions as cookies for situations in which cookies are disabled on a particular client machine or if cookies are otherwise undesirable or impossible to use. Briefly, in URL rewriting, the data that would have been contained in a cookie is appended to the end of the URL in the request. URL rewriting and particularly its use as a substitute for cookies is well known in the art.
Large Web site operators may own their own server (or a server farm comprising multiple servers) dedicated to a single “Web site”. On the other hand, smaller Web site operators may farm out maintenance of their Web sites to other companies that might support multiple Web sites on a single physical server machine. These companies are commonly called Web hosts or Web hosting companies.
Many Web site operators, and particularly commercial Web site operators, have a desire to identify and attract as many persons as possible with an interest in the particular subject matter of the Web site as often as possible. One step that typically is necessary to achieve this goal is to collect personal information about the individuals that visit the Web site. Such information provides at least two avenues of attracting visits. First, personal information such as e mail address, mailing address and telephone number enable the Web site operator to contact the individual with advertising or other information of interest. Secondly, a collection of demographic information about a large number of visitors to the Web site may enable a Web site operator to determine demographics of its target audience and thus better target advertising or other information to persons with similar demographic profiles.
Personal information can be collected by asking visitors to the Web site to provide personal information in an online form or questionnaire.
The same type of personal and demographic information about individuals that visit other Web sites that have similar focuses as (or focuses that are known to have a high demographic cross-correlation with) the focus of the particular Web site also can be useful in targeting advertising towards those individuals.
Many companies are willing to sell or otherwise share the personal information it gathers about visitors to its Web site with other companies.
Another aspect of attracting and keeping customers is making Web sites as convenient and attractive to users as possible so that they will be more inclined to return to the Web site. Accordingly, many Web site operators have a strong desire to keep track of the ways in which individuals utilize the Web site in order to determine which aspects of a Web site users like or dislike. Useful information in terms of making such determinations include things such as (1) from what other Web sites users have hyperlinked to your Web site, (2) which pages on your Web site receive the most and/or fewest hits, (3) how long users tend to view a particular page, (4) on which pages users have entered the Web site, (5) from which pages users have exited the Web site (to go to another Web site or log off the Internet altogether), and (6) the particular browser software used by visitors. This type of data is commonly termed click stream data.
Traditional log file analysis techniques can be used to gather click stream data of users of a particular Web site to develop a log of data indicating the page (or resource) requests made by Web site users in order to collect some of the aforementioned useful information.
As is well known to those of skill in the art of Web site design and Web hosting, cookies are used extensively in gathering and tracking such information. For instance, a cookie identifying the particular user (or at least the particular client machine) can be included in each request, thus allowing tracking of one's progress through a Web site. The same objectives can be accomplished using URL rewriting.
A technology called “Single-Pixel” technology has been developed that can be used to gather information similar to the information gathered through traditional log file analysis. With Single-Pixel technology, tags can be embedded in an HTML page that cause the browser at the client machines that receive that page to send click stream information in the form of cookies (or rewritten URLs) to a click stream analysis (also called a usage analyzer) engine on a server on the Web. That server typically is (but need not be) a separate server from the server of the particular Web site that is serving the content responsive to the client machine's requests. Other methods also are known for sending Single-Pixel data for collecting click stream information. Such other methods include query string parameters and hidden form data. The usage analyzer engine maintains a log containing information for each request it receives. The log entries can be analyzed and correlated to derive the aforementioned type of information.
Web hosting companies are particularly interested in click stream and other Web site usage information and often share such information gathered with respect to each of the companies to which it provides Web hosting services (i.e., its customers) with all of its customers.
Many individuals who use the Internet find this sort of gathering of personal information and Web surfing habits about themselves offensive or do not want such information about them to be gathered.
Accordingly, it is an object of the present invention to provide an improved method and apparatus of gathering click stream information.
It is another object of the present invention to provide a method and apparatus for gathering click stream information while preserving the privacy of the individuals from whom the information is gathered.