1) Field of the Invention
The present invention relates to a technology for classifying a plurality of uniform resource identifiers (URLs), which are substantially the same but are slightly different, into a plurality of groups to perform a log analysis on an IP network including a plurality of servers, such as a Web server, an application server, and a database server.
2) Description of the Related Art
In a system including a plurality of servers, such as a Web server, an application server, and a database server, each server operates in conjunction with other servers. For example, the Web server receives a hyper text transfer protocol (HTTP) request from a user and sends it to the application server; the application server receives the page request and sends a structured query language (SQL) query to the database server; and the database server receives the SQL query and searches a database therein. Conventionally, however, it has been difficult to determine, for example, the cause of a fault occurred in the system or a bottleneck in the system, since the conventional technology only monitors performance of each server (for example, the utilization thereof and the cache-hit ratio thereof) respectively.
For the determination, it is necessary to perform a so-called “data classification process”. A typical example of the data classification is a process of picking out the same person's records redundantly registered in a customer database of a company.
For example, when a plurality of records of a customer A is found in the customer database by comparing each customer's attribute (such as name, telephone number, and address), all records of the customer A is integrated in one record.
In such cases, however, character string comparison sometimes cannot determine whether the records are the same customer's records. For example, the telephone numbers in the records can be different since the customer A changed the telephone number. Similarly, the addresses in the records can be different since the addresses in some records are abbreviated. The data classification process is required for such cases, which includes conversion into regular expression, deletion of unnecessary parameters, and grouping of character strings that are substantially the same.
According to the present invention, however, a data classification for a log analysis of a system including a Web server is taken as an example. The analysis is performed for evaluating performance of the system by calculating an average of response time of the Web server for each Web page, based on a log of the Web server in which the URL and the time of each access are recorded.
The data classification is performed on the URLs, which have different structures, properties, or objects to each other. The URLs includes not only a static URL corresponding to an existing Web page, but also a dynamic URL corresponding to a Web page to be created by an application program. The dynamic URL includes a filename and parameters of the application program. Examples of the URL are:    (0) http://hostname/static.html;    (1) http://hostname/dynamic.asp?PARAM1=v1&PARAM2=v2&PARAM3=v3&PARAM4=v4;    (2) http://hostname/dynamic.asp?PARAM1=v1&PARAM3=v3&PARAM4=v4;    (3) http://hostname/dynamic.asp?PARAM1=vx&PARAM3=v3&PARAM5=v5; and    (4) http://hostname/program.asp?PARAM2=v2&PARAM4=v4.
The example (0) is an example of the static URL, which identifies a file “static.htm|” on a Web server “hostname”. The examples (1) to (4) are an example of the dynamic URL, which respectively includes the filename (such as “http://hostname/dynamic.asp” and “http://hostname/program.asp”) and a list of parameters following “?”. Each parameter includes the parameter name and the value thereof, which are separated by “&”. The URL in the example (1) includes parameters PARAM1, PARAM2, PARAM3, and PARAM4 and values v1, v2, v3, and v4 respectively.
When an operations manager tries to determine whether the Web server is operating normally, the sum of processing times of all accesses is divided by the number of the accesses for calculating an average time that is required for the Web server to send the Web page to a client after receiving a request.
However, what the operations manager wants to know actually can be an average processing time for each program or for each pattern of parameters for the program. When the operations manager focuses on the average processing time of each program, it can be calculated by neglecting all the parameters included in the URL. However, in some cases, the processing executed by a program can be largely different according to whether a specific parameter/value or a specific combination of parameters/values is included in the URL. Therefore, if the average processing time of each program is calculated as described above, the operations manager can overlook a potential failure that can occur when a specific parameter/value or a specific combination of parameters/values is included in the URL, in spite of the fact that the analysis is performed to identify the potential failure and the components of the system impacted by the failure.
In such cases, therefore, the average processing time needs to be calculated for each parameter/value included in the URL. However, if the URL is treated just as a character string, there are going to be too many types of URLs in which only the parameters are slightly different. For example, the URLs in the examples (1), (2), and (3) respectively include different parameters and values thereof, even though including the same filename of the program “http://hostname/dynamic.asp”.
In such cases, the URLs of the examples (0) to (4) need to be converted into regular expression, for example:    (0′) http://hostname/static.html;    (1′) http://hostname/dynamic.asp;    (2′) http://hostname/dynamic.asp;    (3′) http://hostname/dynamic.asp; and    (4′) http://hostname/dynamic.asp, or when the operations operator focuses on the PARAM2, for example:    (0″) http://hostname/static.html;    (1″) http://hostname/dynamic.asp?PARAM2=v2;    (2″) http://hostname/dynamic.asp;    (3″) http://hostname/dynamic.asp; and    (4″) http://hostname/dynamic.asp?PARAM2=v2.
However, in many cases, even the operations manager cannot determine what rules are required for the data classification to perform the analysis at the time of preparation.
For example, in many cases, which parameter influences a processing time of a program is known only to a designer of the program or is described only in design specification. Generally, the designer is not the operations manager. Moreover, the actual program can be modified from that described in the design specification.
Conventionally, the analysis and the preparation thereof, such as the data classification, were performed on a trial and error basis. Moreover, the preparation had to be programmed for each analysis (for example, see Japanese Patent Application Laid-open Publication No. H4-297229).
However, it is not realistic to perform the data classification manually in a large scale system. Even if the preparation (including the data classification) is performed automatically by a program, it can take a lot of time to create the program for each analysis.