Along with the development of information service on the Internet, government departments, companies, academic institutions, and research institutions either have already possessed their own websites or are building one. Each website is operated by a Web server at the backend. A Web server is software used for managing Web pages, and allows the pages to be used by a client browser either through a local network or through the Internet. Today commonly used Web servers include Enterprise servers of Apache, IIS, and Iplanet. Managing a website requires not only paying attention to daily throughput of associated server, but also understanding access instances of each page of the website, improving page contents and qualities based on click frequencies of each page, improving readability of the contents, tracking data having business transaction procedures, and tracking background data for managing the website, etc.
This is particularly true for online companies with business of e-commerce or search engine. Therefore, operations and access instances of the Web server are required to undergo a detailed and thorough analysis in order to understand operation conditions of an associated website, and discover existing deficiencies to facilitate better development of the website. These requirements may be fulfilled through conducting statistics and analysis of a log file of the Web server. Commonly seen log analysis tools include WebTrends, Wusage, wwwstat, http-analyze, pwebstats, WebStat Explorer, webalizer, AWStats, etc. A process of analyzing and examining a log file is a complex process for uncovering unknown and valuable pattern(s) or rule(s) from a tremendous amount of data for the purpose of decision making.
As contents of a website are continuously updated and changed, the management team of the website needs to obtain a log file's analysis result timely, e.g., obtain such statistical data as PV (i.e., page view) of previous day on the next working day. At the same time, as the Internet becomes more popular, the number of Internet users increases continuously. Page view of a website may increase from levels of hundred thousands, and millions, to levels of tens of millions, and hundreds of millions. Log file data volume of a Web server may also increase from a few ten megabytes to a few ten gigabytes, and even up to terabytes. On the other hand, related time requirement for log file statistics and analysis has not become lower. As a result, how to timely and efficiently conduct analysis and statistics for ever-increasing log files has become an unavoidable problem encountered by the technicians in the art.
Existing commonly seen methods are log analysis methods based on distributed computing network. A distributed computing network is a computing cluster made up of multiple computers. The fundamental concept of distributed processing is to have a file divided into multiple small files, with each small file being unrelated to one another. As such, each part of the file can be processed separately on different machines, and analysis results thereof may be combined at the end.
FIG. 1 shows a diagram illustrating a topological structure of a distributed computing network. In this figure, a log analysis server 110 is responsible for obtaining a log file from a web server, and sending divided log files separately to nodes 121, 122, and 123 for analysis. After analysis is completed, the log analysis server 110 obtains and combines analysis results of the nodes to get a final log analysis result of the web server. In analyzing a log file using distributed processing, a common practice is to divide the log file according to the associated website's structure. For example, if contents of a website are composed of news, forum, and blog, the log file is divided into news log, forum log, and blog log, and processed separately by the nodes 121, 122, and 123, respectively. Naturally, a user may decide to add new nodes based on the number of divided log files. In reality, activities of a user in accessing a website are continuous. However, the above processing method will divide log information of a user who has accessed news channel, forum, and blog into three parts, causing a failure to obtain a complete access path of the user. For example, the user may have accessed eight pages, accessing news on the first two pages (ua1, ua2), forum on the third and the fourth pages (ub3, ub4), news on the fifth and the sixth pages (ua5, ua6), and blog on the last two pages (uc7, uc8). Under this circumstance, the access path of the user is divided into three parts: the first part is ua1, ua2, ua5, ua6, which is a nexus path of the user in the news channel, the second part is ub3, ub4, an access path in the forum, and the third part is uc7, uc8, an access path of the user in the blog. As a result, originally related contents are separately processed by three nodes, leading to a disconnection of the user's access history and a failure in analyzing relationship among various contents.