1. The Field of the Invention
The field of this invention is in the area of usage tracking systems, methods, and apparatuses for computer servers and the like. More specifically, the invention is directed to usage tracking systems, methods, and apparatuses for tracking user activity on information servers, more particularly information servers in hyper-media systems such as the World-Wide Web ("WWW"). The invention also relates to client-side usage tracking servers for computers connected by a communications network according to the client-server model.
2. Present State of the Art
The proliferation of communication networks has provided substantial opportunity for easy access to a wealth of information from a variety of sources. Many times, such information is made available on a server computer that can be accessed through use of a communications network by one or more remote client computers using the appropriate conventions and protocols. For example, a user may dial up a bulletin board by means of a modem in order to access the information contained at the bulletin board site. Also, the Internet provides large scale access of many differing kinds of information by a variety of clients.
Particularly important in this field of technology are hyper-media systems such as the World-Wide Web. Because a user may quickly or easily transition between desired objects, usually by means of a graphical user interface, the "Web" and other systems have show great promise in reaching a large percentage of an ever growing potential audience. A hyper-media system is easily traversed by the user operating a graphical user interface ("GUI"). A user can select with a mouse or other pointing device certain icons or areas on the screen that will access a different link at that particular web site or to an entirely different Web site, which may be physically located anywhere in the world.
The World-Wide Web is based upon the Hyper Text Transfer Protocol ("HTTP"), which allows a user to quickly and easily access any number of servers attached to the Internet and to quickly and easily jump from one location to another. The locations may be on the same information server that a user is currently "visiting" or may be an information server located half way around the world. This "Web" of information servers represents a vast store of easily accessible information.
Typically, a user will access a particular object, often a hypertext document (though audio files, video clips, and other object types exist and are popular), from an information server to be processed or interpreted at the client computer running a "Browser." A hypertext document is an ASCII file having text and coded information according to the Hyper Text Markup Language ("HTML") definition. The HTML codes within the hypertext document object will be interpreted by a client browser to format the textual characters in a pleasing manner on the browser screen.
The hypertext document object may also have HTML codes that reference the browser client to other objects such as image files that are designed for the document object page layout. These images are designed to appear on the browser screen alongside the formatted text. In order to accommodate user control and to minimize image processing time, client browsers in many instances allow alternative handling of image data. For example, browsers may be selectively set to not process images at all, process them in an abbreviated fashion, or allow processing to be interrupted when the user chooses to scroll past an image or leave the document.
HTML codes also exist for directing the browser to selectively access, rather than automatically access, other objects at any Web location worldwide. Typically, an icon image or a portion of visually distinguished text is selected using a mouse or other pointing device to cause the browser to access the referenced object. The referenced object may be another location within the existing object, another object on the same information server, or another information server anywhere on the Internet that supports HTTP. Such hypertext "links" allow easy perusal of related information for users navigating through content arranged in such an organized fashion.
When a service provider makes information or services available from an information server via a communications network such as the Internet, it may be helpful to track the usage of that information or those services (i.e., requests for access to the information at the different areas of the server where the information is located) for many purposes. For example, a service provider could optimize the information content or services based on the popularity of certain types of information or services, or improve the organization of the hypertext linkages so that popular information or services are acquired more quickly and efficiently.
Furthermore, statistical and demographic information regarding client and/or user usage may be helpful in soliciting advertisers and sponsors for particular hyper-media projects. Knowing what kind of information is popular, combined with the audience interested in that information, provides another means of access to cognizable consumer groups. Focused, pinpoint information can also allow better tailoring of information to particular user profiles.
Currently, usage tracking is done on the server side of a system arranged after the client-server model. Every time a request is made from a client, the server will log particular information for future reference and analysis. For example, useful information to track would be the time spent with by client connected to the information server, who the client is, and what files or other objects in the server hierarchy have been requested.
The granularity, or amount of detail, found in server tracked information is dependent upon the accessing protocol. In the World-Wide Web, it is possible to track from the information server side, the following kinds of information: the product and version of the client, the user name of who the request is from, the address of the previous object (referer), amongst other information.
There are a number of problems that exist with respect to information server side tracking that diminishes the value of the information in the usage log. Some of these problems actually impede the efficiency or slow down the delivery system making it inconvenient and annoying to users, since there is an undue amount of lag time between information or object requests and receipt of the requested object from the information server.
One problem is the amount of server overhead in terms of processing, disk access, etc., that is required to keep a server side log of client requests directed to that information server. Because the information server must process such a log, there may be diminished capacity in the processor and other system resources for responding to client requests, thereby reducing the overall throughput of the information server in terms of the amount of information sent to clients.
Another problem is the nature of server-side logged usage data. Often, a complex amount of processing must be done on the server side log in order to glean useful and relevant information. This is due, in part, to the very low level of logging based on asynchronous requests and lack of user or session specific metrics that could aid in the statistical and demographic collection process. In other words, it would be useful to have information aggregated into a number of request-response "transactions" that better model the nature of user activity at the particular information server. Such "session" resolution does not inherently exist in some stateless hyper-media protocols such as HTTP.
Requested data can also be cached at a number of locations between the client and the information server. For example, the client may cache recently acquired information and not generate a request to the information server while one or more users are browsing through information in the client-side cache. This valuable information as to user interest in a particular document may be entirely lost by existing server side usage tracking facilities.
Furthermore, intermediate "proxy" servers or "gateways", which often transparently exist as part of the communications path between the client and the server, may contain information wanted by the client. If such a case arises, the proxy server can fulfill the request without resort to the information server. Consequently, the information server may lose knowledge that a request for that particular piece of information has been made by a client.
Statistically, a very large number of web clients use proxy servers to make direct contact with an information server. Most corporations and online services deploy proxy servers to improve the efficiency of their service to their users (i.e. employees or subscribers) and to provide an added level of security by not allowing direct Internet access to their users' client software.
Another proxy server application occurs when the proxy server provides a logical portion of the Internet with access to information without each client in that logical portion having to communicate with the original source of the information that may be half the world away. For example, all clients in Australia wanting access to a particular object located in Canada would benefit by having a proxy server for that object located in Australia.
The proxy server stores a copy of the object based on the first client to make the request and then all subsequent requests for that object can be serviced by the proxy server rather then the original information server. An object on a proxy server will eventually "expire" and be erased according to some algorithm optimized for overall Internet performance. The algorithm may be based on time, number of requests for the object, proxy server storage capacity, and other relevant factors. Internet users between Australia and Canada also benefit since traffic in that area has been reduced by use of the proxy server, allowing more efficient utilization of the transmission link.
Because the use of a proxy server will in many instances cause an information server to miss important information with respect to client requests, an information server that is able to control proxy server operation will often times be set to force all requests to the information server (i.e., by setting all information to expire immediately), thereby destroying the advantages of reducing network traffic, reducing information server load, and reducing the quality of the users' experience overall.
Other forms of creating greater efficiency or security may inadvertently conceal pertinent information from the information server usage tracking system. Inaccuracies are introduced because of proxies, address translation devices, and other forms of address aggregation.
Other forms of creating greater efficiency may inadvertently conceal pertinent information from an information server usage tracking system. Network address translation boxes ("NATs") are used by various organizations in connecting to the Internet and allow many different client requests to go through a single IP address which hides information from the information server. Specifically, NAT boxes make multiple clients appear as one by sharing a common network address across multiple servers to which the clients are logically connected. A small number of such NAT boxes will allow access for many hundreds or thousands of users throughout an organization causing an identity problem for the information server because of the shared IP address. Thus, to the information server it appears that only one user or client is accessing a particular information server rather than all individuals from an entire organization.
This concealment also occurs when Internet subscription services and national on-line services such as Compuserve.RTM. share a common IP address across a number of different subscribers or users. The problem becomes even more exacerbated when an organization uses a network address translation box and has Internet access through means of a subscription service.
Yet another problem that may impair the validity of information server usage tracking statistics are Web crawlers and other forms of automated information gathering. These automated information gathering programs will travel through the Web and categorize the different available objects. Data searches can then be made on this aggregated, organized, and abstracted information to find desirable Web sites.
The information gathering programs will sometimes leave "access tracks" that are not true representations of user activity and hence may skew an analysis of information server usage tracking logs. Despite the aforementioned problems, usage tracking, though imperfect and fraught with potential shortfalls, remains a very important activity in guiding service providers in terms of what to provide and how to organize it. Because of such interest, there exists a strong need to improve the quality of usage tracking whenever possible.