1. Field of the Invention
The present invention generally relates to the analysis and management of traffic data. More specifically, the present invention relates to a system, method and storage medium embodying computer-readable code for providing an efficient and adaptive web accesses filtering process for log analysis.
2. Discussion of the Related Art
Web sites have become one of the most important vehicles for one to reach out to a potentially worldwide audience today. Web servers interconnected via the Internet provide access to the public with minimal investment in time and resources in building a web site. The web sites make available for retrieval and posting valuable information. The information may be presented in a wide range of media and in a variety of formats, including audio, video and traditional text and graphics. Many of the web sites are also equipped with interactive mechanisms, allowing a user to interact with the web sites, such as on-line shopping. The ease of creating a web site makes reaching the worldwide audience a reality for all type of users, form corporations, to startup companies, to governmental agencies, to organizations and individuals.
Unlike other formats of media, such as television or radio, web sites are interactive and the web servers or an outside web site hosting service can passively gather access information about each user by observing and logging the traffic data packets exchanged between the web server and the user. Companies and organizations often employ outside web site hosting services to not only host their web sites and deal with complicated problems associated with the web sites, but to also generate web site analysis by observing the web access log of the web site. The web site analysis may, for example, develop detailed traffic statistics on a web site. The traffic statistics may include resources accessed, referrers, web server technical statistics and demographics information. Examples of resources accessed are statistical information related to most requested pages, most downloaded/uploaded files, most accessed directories, and paths users navigated through the web site. Examples of referrers are statistical information related to top referring sites/URL (Universal Resource Locator) and top search engines/keywords. Examples of web server technical statistics are statistical information related to server errors and client errors. Examples of demographics information are statistical information related to top geographic regions from which the web site is accessed, most active countries/organizations, and active states/cities/provinces.
The web site analysis may also generate reports with information on visitors and their behavior with respect to a web site. A visitor to a web site can be thought of as a person or a program that is accessing that web site. The visitor is identified either by the IP (Internet Protocol) address/domain name of a client machine or by a xe2x80x9ccookie,xe2x80x9d which is a unique string that identifies each visitor. The information on visitors may include visitors by number of visits, new versus returning visitors, authenticated or unauthenticated visitors, and top visitor. The visitor""s behavior with respect to a web site can be taken as how a user makes use of the web site. The behavior of a particular visitor may be identified from different statistics, such as the top paths taken, the top pages accessed, the top entry/exit pages from the web site, how many times the visitor returns at a later time, and how much time a visitor is spending on the web site.
The web site traffic analysis and visitors"" information and behavior reports are important because they are often used to understand the effectiveness of a web site. However, there are difficulties associated with making the analysis and generating the reports in a timely fashion especially in light of the fact that the number of accesses by users, or traffic data packets exchanges between users and the web site, can be very large. A popular web site is likely to contain many servers, each serving millions of accesses per day. A web site analysis service is likely responsible for many popular web sites at a given time. As a result, dealing with the access logs from all the popular web sites may involve processing billions of accesses per day. Moreover, some accesses to the web site, such as those from automated agents, third party performance services and quality assurance checks, reduce the accuracy of the analysis and reports. Automated agents, such as spiders for search engines, are programs that traverse web sites automatically for html (hypertext markup language) validation, link validation, etc. Third party performance services, such as Keynote, generate web server performance statistics for a web site, e.g., how fast web servers of the web site respond to requests. While the accesses from automated agents, third party performance services, and the link, are logged in web log files in the same way as accesses from individuals, they do not reflect user behavior. Consequently, these accesses should not be included in reports intended to reflect user behavior.
There have been conventional web site traffic analysis systems that use web logs for performing analysis, but they generally have one or two shortcomings. Either they do not filter their web logs, in which case their analysis contains a lot of xe2x80x9cdirtyxe2x80x9d data that reduce the accuracy of their reports/analysis, or else they use a simple filtering mechanism before analyzing their web logs. The simple filtering mechanism is usually a simple linear search that compares each logged web access with each exclusion access from a list of accesses to be filtered. The simple approach could work for cases where a small number of web logs are involved and where the filtering requirements do not vary a great deal. However, as discussed, the volume of accesses or logs to be analyzed by a web site analysis service responsible for a number of web servers is quite large and the set of accesses or logs to be filtered often vary between web servers. The simple filtering mechanism does not scale to allow processing of a large amount of data in a timely fashion and is not adaptive. As a result, it no longer suffices. Therefore, there is a need for a system and method that provides an efficient and adaptive IP address filtering process for log analysis.