In general, devices that provide services using networks record log files including logs corresponding to individual services, and logs about service operation are stored in such log files. Individual services may have a variety of forms of logs. In the present specification, unstructured data refer to such types of logs because they do not have any consistent form. In addition, the unstructured data in the specification are not limited to text data but may include at least either of text data and binary data. In a Table 1 shown below, an example in a text form is provided as an example of the unstructured data.
TABLE 1<Unstructured text example 1 - Bro IDSlog>1351145805.760024 zPnv2YKLHqf 192.168.1.26 58349114.108.1.2 80 unescaped_special_URI_char - F< Unstructuredtext example 2 - SecuiNXG log><214>[LOG_DENIED] id=firewalltime=“2014-03-22 p.m. 11:22:33” fw=nxg500.naver.com pri=6rule=1 src=210.226.11.212 dst=192.168.1.100 proto=443/tcpsrc_port=9080 dst_port=80 act=DENY msg=“Count=1Interface=External”
If the aforementioned unstructured data were stored, a user cannot know what individual items mean and cannot analyze them easily. Therefore, it is necessary to extract individual fields to put them in a common form and convert a result of extraction to a structured form. This is referred to as normalization of the unstructured data and examples of the structured data as results of normalizing the above-described unstructured data are as shown in a table 2 below.
TABLE 2Result ofResult ofnormalization ofnormalization ofName of fieldexample 1example 2Log generation2012-10-252014-03-22time15:16:4523:22:33Source IP192.168.1.26210.226.11.212Source port583499080Destination IP114.108.1.2192.168.1.100Destination port80443Protocol—TCP
In the past, there were mainly two methods used to normalize the unstructured data. The first method was for a program developer to individually code for each of unstructured data formats which have different types (by using a programming language) and the second method was to normalize the unstructured data by directly defining meta information, i.e., information necessary to understand the unstructured data, in a form of code including XML, etc.
In the first one, it is almost impossible for a common user who is not familiar with a programming language to normalize the unstructured data, and even a professional developer may need much time to normalize the data.
The second method, which solves a shortcoming of the first method to some degree, is comprised mainly of two steps of preprocessing and analysis. The preprocessing step is a step of parsing the unstructured data and then displaying a field value as the result to the user, and the analysis step is a step of coding a format-converting rule where the user determines a field name by reading the result and analyzing a meaning and analyzes and normalizes the type of the field value into a uniform structure. These conventional methods are problematic as the user himself/herself must program the code at each step. If a field is extracted through a separator or a regular expression directly designated by the user at the conventional preprocessing step, the user reads it and defines a name of an item corresponding to the field at the step of analysis. In addition, the user cannot immediately know how data are converted by the parsing at the preprocessing step, and is only able to check them after storing them. Besides, since the user can check whether a data type is proper only after they are stored and the user may change the data type only then, a response to this problem is slow.
The present inventor, therefore, intends to propose a universal method for automatically normalizing unstructured data and a system using the method, which are easy to use for a user who is not a developer.