A server is a computing device that responds to requests from clients. A Web server is a server connected to the global network known as the “Internet” and responds to requests received from Web clients over the Internet. As used herein, the term “Web server” may also refer to a plurality of servers organized to handle a large number of requests for a Web server, i.e., a distributed Web server system. The term “Web site” is often used to refer to a collection of Web servers organized by a business entity, individual or organization for diverse purposes. The term derives, most likely, from the language used to access a Web server. A user is said to “go to a Web site” when the user directs his or her computer (Web client) to make a request of one of the site's Web servers and to display the response to the user, even though the user and the Web client do not physically go anywhere. The user perception is that there is a location, a “site” on the Web where this data exists, but it should be understood that the term “Web site” often refers to the Web server or servers that respond to requests from Web clients, even though “site” does not necessarily refer to the physical location of the Web servers. In fact, in many cases, the servers of a Web site might be physically distributed to avoid downtime when local power outages or network service failures occur.
The term “Web site” typically refers to a collection of pages maintained by a common maintainer for presentation to visitors, whether the collection is kept on one physical server at one physical location or is distributed over many locations and/or servers. The pages (or the data/program code needed to generate the pages dynamically) need not be created by the common maintainer of the collection of pages. In places herein, such a maintainer of the collection of pages is referred to as the Web site operator. For example, an online merchant might set up a Web server with a collection of pages created by the merchant or obtained from affiliates, suppliers, or partners of the merchant and then put hyperlinks in the pages so that a visitor can browse around the “site” as expected by the merchant. As another example, an individual dedicated to dispensing information about opera or an uncommon medical condition might set up a Web server and populate it with pages about the particular subject, including such things as references to pages outside their collection of pages, dynamically generated pages of comments made by visitors, or e-mail sent to the operator of the Web server.
Although many Web sites are targeted to single topics, some Web site operators serve many different interests and have integrated many different “properties” into a large Web site, often distributed over many servers and locations to handle traffic from a large number of visitors. “Traffic,” generally refers to overall network use at a given moment, or it can refer to specific transactions, records or users in a data network, as in a Packets Per Second (PPS) measurement of Internet use. As used herein “traffic” refers to use of a Web site or any of its pages over a given time. “Properties,” as used herein, means categories of content provided by the Web site. For example, the Yahoo! Web site (www.yahoo.com) brings together many properties of interest under one umbrella, such as a financial property (for providing stock quotes and other financial information and data), a sports property (for providing sports scores and news), an auction property, a chat property, an instant messaging property and many others. Complex sites, where visitors come for possibly unrelated properties, are often referred to as “portal sites”.
|Although the typical Web site includes one or more servers that receive requests and provide responses according to the HyperText Transport Protocol (HTTP), the description herein should not be understood as being limited to a particular protocol or a particular network. For example, the Web site might be connected to the Web clients by an intranet, wireless access protocol (WAP) network, local area network (LAN), wide area network (WAN), virtual private network (VPN) or other network arrangement. In other words, a Web site for which traffic is being monitored can be monitored independent of the protocols or network used. “Web” typically refers to “World Wide Web” (or just “the WWW”), a name given to the collection of hyperlinked documents accessible over the Internet using HTTP. As used herein, “Web” might refer to the World Wide Web, a subset of the World Wide Web, a local collection of hyperlinked pages, or the like. More generally, a Web server is a server responsive to requests received from a Web client.
Typically, requests and responses are considered “pages”. For example, with the HTTP protocol, a Web client requests a page from a Web server and the Web server responds to the request by sending a page. In the HTTP protocol, a Uniform Resource Locator (“URL”) identifies a page and that URL is presented to the Web server as part of a request for a page. The pages are often HyperText Markup Language (HTML) pages or the like. The HTML pages can be static pages, dynamic pages or a combination. Static pages are pages that are stored on the server, or in storage accessible by the server, prior to the request and are sent from storage to the client in response to a request for that page. Dynamic (“on the fly”) pages are generated, in whole or in part, upon receipt of a request. For example, where the page is a view of data from a database, a server might generate the page dynamically using rules or templates and data from the database where the particular data used depends on the particular request made.
The term “page hit” refers to an event wherein a server receives a request for a page and then serves up the page. In even a moderate sized Web site, the servers might handle millions of page hits per day. A common measure of traffic at a Web site is in the number of page hits (often referred to as “page views”, especially in an advertising context) for particular pages or sets of pages. Page hit counts are a rough measure of the traffic of a Web site. More refined measures include unique visitor counts, where only one page hit is counted for each unique client for a predetermined period. Such measures work well when the traffic of interest relates to particular pages, but are generally not informative when traffic by topic is desired and multiple pages may relate to one topic and one page may relate to multiple topics.
For example, where a stock information Web server serves up just a page for each stock and only one page relates to that stock, it would be a simple matter to determine levels of user interest in particular stocks by examining the server logs of the Web server to determine which stock pages are being served the most. Unfortunately, most real-world Web services are not so well defined. One more complex Web site includes servers that serve news, sports and financial content along with content on many different subjects and pages that relate to a common topic might be served from more than one of those content components. With the requests spread over different content components, the level of user interest would not be accurately reflected in a measurement of interest in just one content component. For example, interest in a particular athletic shoe company might be expressed by traffic to pages containing news stories relating to the company, traffic to sports pages referring to the company, traffic relating to financial content about the company, searches for the company's products, purchase transactions for the company's products, etc. Also, some requests might be falsely associated with interest in the company if, for example, users use a search term that has more than one meaning, one of which relates to the name of the company.
Such a Web site might also include search capability, wherein a user submits a search request using their Web client and a Web server responds with a page that contains search results. It is a simple matter for a search engine (a Web site set up to respond to search requests) to log all of the search requests. Typically, a search request is in the form of a search phrase containing one or more search terms. Search requests can be counted by search term, e.g., count the number of times “Ford” or “sports” was used as a search word in a search phrase, but such counts have limited utility where one search term might relate to multiple topics and multiple search terms might relate to one topic.
Where page hits, search requests, or other “events” such as purchases, are logged or loggable, some operators of Web sites track statistics other than just page hits or search requests. One well-known statistic often seen in Web systems, and elsewhere, is a “top-n” list, such as a “Top Ten” list. Such a list presents the n highest requested items. For example, a newspaper might list the 40 best-selling books for a given month, ranked by industry-wide sales. The list might indicate, for each book on the list, the book's ranking for the prior measurement period. As another example, a Web site operator might include a page served by the Web server(s) that lists the top sellers for that operator.
As yet another example, a Web site operator might include a page served by the Web site that shows the top-sellers for various categories. For example, if the Web site operator is a toy retailer, the operator might create pages to be served by its system wherein the pages list the top-selling toys for infants, the top-selling toys for toddlers, the top-selling toys for teens, etc. In a variation on the basic count of items sold, some Web site operators might include statistics showing how various items are moving up or down in sales. For example, a list could be presented showing the top 40 sellers for the month along with their sales rank for the prior month, or a list ranking items in order of increase in sales or sales rank.
As with the Web server that serves up specific pages for specific topics, such as one page per stock on a stock information Web site, sales statistics such as those described above are easy to generate. An electronic commerce server can simply log each purchase and then a program can scan the log for a period of time to determine sales levels for each item. The sales can also be easily categorized where the items are already categorized. For example, a book selling Web site can log all sales of books, where each book is already categorized (e.g., “fiction,” “reference,” “technical,” “self-help,” “other nonfiction,” etc.) and then aggregate the sales for category to identify sales by category or top sellers within a category. However, the “top-n” or best-seller lists are limited in that the categorization of the items must be done manually or along lines that are set out ahead of time and worked into the data. Thus, such a system cannot be easily adapted to events that are not already well-categorized; it does not combine information across multiple events and types of events, nor is the information normalizable so that detailed and relative statistics can be derived.
Some traffic analysis modules have been used to analyze traffic over a Web site, but their functionality is limited. One such module performs basic statistical analysis of Web server logs to determine Web site usage. They are typically not designed to compute interest in particular topics, although the statistics they offer indirectly reflect that interest. One problem with such modules is that they either rely on manual associations of events to topics or do not associate events with topics, so the former approach is not scalable and the latter approach does not group events in a meaningful manner.
Heretofore, however, none of the statistics systems described above allows for the more sophisticated, and thus informative, measurements often needed to make overall strategy decisions with regard to trends, advertising purchases, popular culture review, product marketing and other decisions that need to be made in light of traffic statistics where the traffic relates to complex events and requests.