This invention relates to collecting and analyzing information about the content requested and provided in a networking environment. More specifically, the invention relates to collecting and analyzing information about the content requested from users and provided by a server in an e-commerce application on a network such as the Internet. In particular, the invention relates to collecting such data over a period of time in a log record which can be subsequently used in aggregation and analysis.
The Internet, fueled by the phenomenal popularity of the World Wide Web, has exhibited exponential growth over the past few years. It has gone from being a communication route primarily for scientists, researchers, and engineers to an essential information exchange vehicle for broad segments of the populace, including consumers, marketers, educators, children, and entertainers. Over one billion Web pages currently exist on the Internet, and over 40 million users read and interact with them. As the Internet""s commercial value is recognized, numerous companies and organizations are experimenting with electronic commerce (also referred to as e-commerce), the buying and selling of goods, information, and services over the Internet (see, for example, IBM""s Web site for Macy""s at http://www.macys.com). And as more and more of these companies demonstrate the financial viability of electronic commerce, there has been increasing momentum to develop sites that transact business over the Web.
Any Web site owner needs to know whether the Web site effectively serves its intended purpose; that is, how many people visit the Web site, who these people are, what they want, and what they do at the site while they are there. This is particularly true of the domain of electronic commerce. The ability to analyze and understand traffic flow, the way customers navigate from page to page in a site, is critical for successful product marketing and sales.
The major source of user activity data available today is the Web server log. A Web server is the computer that sends World Wide Web documents to browsers upon request. The Web server log is a low-level, technical account of Web server activities and is generated by all commonly used Web servers. The Web server log consists of a file containing an entry for each Web page served, showing the IP (Internet Protocol) address of the client (the machine of the user who is visiting the Web site using an application); a timestamp, indicating the exact date and time on which the visit occured; the URL (Universal Resource Locator) of the requested page, the referrer URL (the URL of the page that the user clicked on to get to the current page), the browser type, and the number of bytes transferred.
Various commercial products and freeware packages (for example, Accrue""s Insight, Andromedia""s ARIA, e.g. Software""s WebTrends, and Aquas""s Bazaar Analyzer) use Web server logs to analyze Web server and user activities and generate reports. Examples of the kind of information that is typically reported are the number of visitors at a Web site during a given time period, the most and least frequently visited pages, the most frequent entry pages (the first page a user visits during a session at a Web site), the most frequent exit pages (the last page a user visits during a session at a Web site), and the visitor demographic breakdown based on IP address and browser type. The URLs in the Web server log often contain special user identifiers obtained by using xe2x80x9ccookiesxe2x80x9d. A cookie is a piece of information shared between a user""s Web browser and a Web server, originating as a message sent by a Web server to the Web browser visiting the server""s site, subsequently stored in a text file on the user""s hard drive, and sent back to the server each time the browser requests a page from the server. From the sequence of URLs in the Web server log and the associated cookies, it is possible to reconstruct the URL paths that individual users traverse, and from this obtain the most frequently traversed paths through the Web site.
Some Web sites in the past have used clever methods for collecting more data about user behavior by using a form of URL rewriting. They tag extra data about the user/requester to URLs of the served Web pages, so that the extra data of their analysis needs will be found in the server log. This method is usually used for adding user-related data (e.g., user-id and session-id).
Advertising banner services have developed an interesting way to measure not only who clicked on their banners, but who saw the banners. These figures not only are used to calculate the rate charged for the banners, but also the effectiveness of the banner, known as the conversion rate. The conversion rate is found by dividing click-thoughs by impressions (the number of times that the banner was served and hopefully seen). Currently the prior art is able to determine conversion rates only for specific types of adverting banners. Some Web advertising services (e.g., Real Media""s Open AdStream) record impressions and click-throughs by using script programs, programs consisting mainly of strung together commands, such as those you might issue at a command line. These services add a script program to the HTML image source tag, which points to the image displayed as the advertisement. (HTML, HyperText Markup Language, is the authoring tool used to create documents on the World Wide Web. Tags are commands, generally specifying how a portion of a document should be formatted; tags can also refer to the links which allow users to move from one Web page to another.) In addition, these services add a script program to the anchor tag, the HTML tag which acts as a link to the advertised site. The first script gets invoked when the advertised image is displayed and records its view; the second script gets invoked when a visitor clicks on the image (to visit the advertised site) and records the click.
One fundamental limitation of existing Web site analysis tools is that they rely solely on information in the server Web log, which is URL-based. Why isn""t this enough? A URL indicates only the location of a served Web page and often very little about its content, particularly if the page in question is dynamic (generated from a database, a personalization profile, or search query) or simply no longer exists.
Business people, on the other hand, are interested in the content viewed by their audience, not the addresses of that content. What products are customers looking at? What products are they being shown? Do pages contain the products in which customers are interested? Is the style of presentation working? Is there easy access to the information the visitor is looking for? What links on each page did the visitor not click on? In an electronic commerce Web site, answers to these kinds of questions can feed back into the architecture and design of the Web site, increase its effectiveness, and thereby maximize the return on investment. Unfortunately, it is not straightforward to answer these questions for today""s Web sites with existing log analysis tools.
A second limitation of conventional log analysis software products is that while they provide the click-throughs of hyperlinks, none of them can provide the impressions of hyperlinks and conversion rates as do Web advertising banner services. Unfortunately, even the method used by the Web advertising services restricts them to collecting impression data only for specific types of hyperlinks such as image-based ad banners, not for text- or form-based hyperlinks. Also, this method is costly, because the script programs need to be invoked on a hyperlink basis (one invocation for every link), as opposed to on a page basis (one invocation for every page).
Without the ability to collect, aggregate, and/or analyze detailed information about the interaction of visitors with Web content, Web designers and marketers currently rely on ad hoc knowledge of a few experts in the area (e.g. creative designers). The current dependence on a few human experts for Web site design and management is evidence that it is more of an art than a science, and that there are not sufficient systems or tools for it. This method is expensive, inefficient, faulty, and subjective. It is often seen that experts express contradictory opinions about the same Web site design.
An object of this invention is an improved system and method for logging information about Web requesters and content of Web pages served by a server on a network, particularly a server on the World Wide Web.
The present invention is a computer system and method for collecting, analyzing, aggregating, and storing information about the content of one or more Web pages served by a server on a network. In a preferred embodiment, the server is on the World Wide Web and is performing an e-commerce function such as hosting a store that sells products or services.
The server has one or more central processing units, one or more memories, and one or more network interfaces connected to one or more networks. A server process is executed by one or more of the central processing units and receives one or more requests for one or more Web pages from one or more requesters connected to the network. The requests enter the server through one or more of the network interfaces. Upon receiving the request, the server produces each requested Web page from one or more memories, serves Web pages to the requester, and continues until all requested Web pages have been served. The Web pages have one or more content elements, blocks of text, images, and/or hyperlinks which provide specific information about predefined areas of interest, in addition to one or more metadata entries, tags in a meta language which categorize the content elements of a Web page. One or more of the metadata entries are associated with the content elements of the respective Web page produced. In a preferred embodiment, each metadata entry has an entry type and an entry value.
The system creates and maintains a log having a plurality of records. Each record has one or more requester fields and one or more metadata fields.
A logger process is executed by the server process. The logger process stores the metadata entries contained in each of the Web pages in one or more of the metadata fields, and stores a requester identification, associated with the requester, in the requester field of the record associated with the respective Web page.
In a preferred embodiment, an aggregation process traverses the log to extract one or more of the metadata fields. A counter set, generated by the aggregation process, has a plurality of counters. Each counter has a counter object (or counter type), a counter event, and a counter value. For instance, the counter keeps track of how many times a particular hyperlink, Web page, product, and/or product category was seen or selected (event) by requesters. A conversion rate set, also generated by the aggregation process, has a plurality of rates, where each rate has a rate object, a rate event and a rate value. For instance, the conversion rate set might track how many times a product/product category was selected with respect to the number of times a particular Web page or hyperlink related to the selected product/product category was seen.