The present invention relates to the field of systems, methods, and apparatuses for classifying input data through application of an evolving network in which groupings within the network correspond to commonalities in the input data. The systems, methods and apparatuses of the present invention are applicable in a variety of areas including network management and intrusion detection, identification of patterns in large databases and files, classifying portions of strands of DNA, identifying patterns in graphs, and classifying particular types of network traffic based on common properties or characteristics.
Adaptive classification systems, including systems based on neural networks and self organizing maps are known in the art, and have been applied to solve problems including network management and network intrusion detection. Such solutions, however, have largely required (i) normalization of the input data into numeric vectors based on predetermined selections of parameters thought to be relevant to a particular problem, or indicative of a particular characteristic, and center point calculations based on the normalized values, or (ii) pre-training of a neural network or similar system based on a known data set thought to be similar to the data set to be analyzed. Both sets of solutions have difficulty adapting to constantly changing data sets (such as network traffic) as, the further the characteristics of the input data move away from those of the training data, the less effective the systems become. Such solutions, including those that require normalization of input data into numeric vectors and computation of center points, also impose computational overhead and a predetermined organization on the input data by virtue of the normalization process. This further compromises the solutions' ability to adapt to new, previously unseen patterns and creates computational overhead that makes the systems unusable in applications like network analysis, in which large volumes of data must be analyzed in real-time. In such applications, storing the data set for later analysis is impractical, both because of the size of the data set involved and because of the need for real time identification of anomalies.
The present invention improves on such systems by utilizing a neural foam that classifies nodes without the need for pre-training. Embodiments of the present invention are also capable of classifying data based on an information distance without normalizing the input data into a numeric N-tuple. This results in a more flexible system that classifies input data in its raw form, thereby making it adaptable to a broader range of applications. It also eliminates the need to maintain an evolving center point, which assists in obtaining greater computational efficiency. Embodiments according to the present invention, including those based on information distance, are adapted to systems in which streams of data must be analyzed in real time, and continuous learning over time is required such that new patterns can be identified more quickly than is typically possible with systems that require pre-training or pre-selection of characteristics that are thought likely to be relevant to a particular problem.
As is discussed above, the present invention is adapted to a variety of applications, including network management and intrusion detection. Prior to the present invention, network management was typically handled in a modular fashion, where a software component or hardware device handled a designated operation. For example, network traffic is typically handled by routers, bridges, and hubs; firewalling is commonly handled by a software application; data access restrictions are commonly handled by a file managing component of an operating system; and e-mail filtering can be handled by an e-mail server routine. These modular network management tools usually utilize locally available information in their operation, where enforced policies are typically based upon one or more parameters relating to a request.
For example, file management systems usually require a data requesting source to identify itself by computer identifier and/or user identifier. The file management system then bases access rights upon the user identification and/or computer identifier. In another example, an e-mail filtering program can analyze e-mail parameters and only deliver e-mail that passes previously established criteria. That is, e-mail can be denied if it comes from a suspect sending source if content contains key words or graphics that indicate that the e-mail is an unsolicited advertisement and if the e-mail message fails to satisfy virus and malicious program detection algorithms.
Another conventional network management methodology relies upon establishing a fixed communication protocol relating to a particular function, where operational decisions can be dependent upon conditions established by the protocol. For example, the simple network management protocol (SNMP) establishes a standard for gathering statistical data about network traffic and the behavior of network components. SNMP defines communication devices as either agents or managers, where an agent provides networking information to a manager application running on a different computer. Similar message protocols and enterprise management protocols exist that define a standard and require external devices to adhere to that standard before operations are permitted.
Unfortunately, policies established by such network management solutions can be foiled easily. More specifically, traditional network management systems can be compromised by outside sources that have knowledge of low-level specifics relating to a system. That is, most complex systems have a number of discernable weak points (sometimes called exploits) that can be used to circumvent network policies that administrators attempt to implement. It is practically impossible to design network equipment that does not have some exploitable weaknesses. As soon as one weakness is patched, two or more new weaknesses are discovered and are available to be exploited. Further, each new hardwire device, operating system, network protocol, and technology introduces its own new weaknesses.
Conventional network management solutions have thus failed to approach network management from a holistic perspective. A holistic approach would permit the decoupling of network policies from modularly defined protocols, devices, and software applications. Accordingly, data synergy achieved when combining network data from available network components has not been leveraged to enact network policies that cannot be easily circumvented. Such systems are thus well suited to embodiments of the present invention, as such embodiments provide a holistic, incrementally-learning data classification system that does not require pre-training and are capable of real-time, or near-real-time, analysis of network traffic, and can do so holistically.