1. Field of the Invention
This invention generally relates to a method and system for estimating entropy, and more particularly, to a method and system for entropy which is adopted to flow analyze of high-speed network.
2. Description of the Prior Art
In recent year, the network has developed very fast and the transmission rate is growing up every day, and therefore, the cyber attack and the unusual behavior are becoming more prevalent, for example, worm, port scan, distributed denial of service (DDoS), address scan, etc. These cyber attacks and the unusual behaviors may affect the normal network environment and the user. At network observation, we may analyze and count the packet header information to observe whether the abnormal state in the network happens.
Entropy, which measures the degree of concentration and dispersion of a given feature space, is utilized as important indicator of changes of network traffic behavior. The higher entropy value indicates the degree of dispersion, and the lower entropy value indicates the degree of concentration. The entropy value is analyzed according to the packet header information so as to understand the change of the network flow distribution and search out whether the cyber attack or the unusual behavior exists. Take, for example, DDoS, many IP addresses from various source are transmitted to the same destination IP address meanwhile the kind of source IP addresses are increased and the attacked packets of destination IP address are increased. Thereby, distribution of network flow will be changed. These special distributions will change the entropy value, and if the distributions focus on some specific flow ID to low the entropy value, but if the flow ID are averagely dispersed so as to increase the entropy value. Thereby, the analyze of entropy value is widely applied in the associated application analyze of the network security for searching out the distribution feature of network flow so as to conclude whether the cyber attack or the unusual behavior occurs.
In high-speed network, a mass of data is needed to analyze and count in short time. If the mass of data are processed by the software, it needs to spend a long operation time and a hung storage space. Moreover, in the information theory, the entropy value is used to measure the concentration degree of data and count all the packet header information right time, and it must spend much time and storage source to calculate the entropy value. The tradition entropy equation is represented as follow (1), wherein m is the total number of packets, mi is the packet number of flow ID iε[n], n is the kind of packets.
                    H        =                  -                                    ∑                              i                =                1                            n                        ⁢                                                  ⁢                                                            m                  i                                m                            ⁢              log              ⁢                                                m                  i                                m                                                                        (        1        )            
For calculating the entropy value in the high speed network environment, Ashwin Lall offer data streaming algorithms for estimating entropy of network traffic to improve the equation of entropy value as follows (2). Then, the S value is deduced according to the equation (2), and the deduced equation of S value is represented as follows (3). Ashwin Lall believes that the entropy value H is calculated according to the estimation of S value. So the S value is estimated by the data streaming algorithms for estimating entropy, and then the S value is imported into the equation (4) to obtain the last estimation entropy value.
                    H        =                              log            ⁡                          (              m              )                                -                                    1              m                        ⁢                                          ∑                                  i                  -                  1                                n                            ⁢                                                          ⁢                                                m                  i                                ⁢                                  log                  ⁡                                      (                                          m                      i                                        )                                                                                                          (        2        )                                S        =                              ∑            i            n                    ⁢                                          ⁢                                    m              i                        ⁢                          log              ⁡                              (                                  m                  i                                )                                                                        (        3        )                                H        =                              log            ⁡                          (              m              )                                -                      S            m                                              (        4        )            
The data streaming algorithms for estimating entropy being a spin-off of the algorithm according to AMS algorithm is represented at Table 1. The data streaming algorithms is mainly divided to three phases, at the first phase, g×z locations are randomly sampled in the stream data, the packet number m must be known and set the allowable error before sampling, so as to determine the number of sample, and regarding to the select equation of g and z, please refer equation (5). The second phase is divided to two portions: update and sample. The flow IDs of all packets are compared in the stream data in update portion. The counter c if someone flow ID was sampled.
                              z          =                      [                                          32                ⁢                                                                  ⁢                                  log                  2                                ⁢                m                                            ɛ                2                                      ]                          ,                  g          =                      2            ⁢                                                  ⁢                          log              2                        ⁢                          1              δ                                                          (        5        )                                X        =                  m          ⁡                      (                                          c                ⁢                                                                  ⁢                                  log                  ⁡                                      (                    c                    )                                                              -                                                (                                      c                    -                    1                                    )                                ⁢                                  log                  ⁡                                      (                                          c                      -                      1                                        )                                                                        )                                              (        6        )            
TABLE 1 Pre-processing stage       z    =          [                        32          ⁢                      log            2                    ⁢          m                          ɛ          2                    ]        ,      g    =          2      ⁢              log        2            ⁢              1        δ             choose g x z location in the stream at randomOnline stage for each item aj in the stream do   if aj already has one or more counter   then    Increment all of aj's counters   If j is one of the randomly chosen locations   then    start keeping a counter for aj    initialized at 1Post-processing stage/ /View the g * z counters as a matrix c of size g x z for i:= 1 to g do  for j:= 1 to z do    Xi,j = m(ci,j log(ci,j) − (ci,j −1)log(ci,j −1)) for i:= 1 to g do     avg[i]:= the average of the X in group i return the median of avg[1],...,avg[g]
The algorithm of Table 2 is simply concluded according to the algorithm of Table 1:
TABLE 2a data streaming φ = (a1,a2,.....am) has m items, where the tth item at (k)consists of a key k ∈ [n].1: Choose a number y uniformly at random from {1,2,...m};2: Maintain a counter C = |{r : ar (k) = ay (k), y ≦ r ≦ m}|;3: Output S = m(C log C − (C − 1) log(C − 1)).
In the environment of high-speed network, the software manner must spend much operation time caused that it can not detect the unusual immediately, so S. Nagalakshmi uses FPGA hardware to implant the data streaming algorithm for estimating entropy disclosed by Ashwin Lall. The S. Nagalakshmi suggests decreasing the number of counters, and proves the error rate of calculation result still to be maintained in the predefine error range. Although S. Nagalakshmi uses 112 set calculation modules to process the packet number in parallel, it still needs to spend much time to perform the comparison and update of memory access. Once it has mass packet and locates in the high-speed network environment, the manner disclosed by S. Nagalakshmi is still satisfied to the requirement of wire speed so as to increase the error rate of estimating entropy.
For the reason that the conventional system and method for estimating entropy could not process the mass packets in the high-speed network environment, a need has arisen to propose a novel scheme that may adaptively process the mass packets in the high-speed network environment so as to decrease the hardware source operated and operation time.