1. Field of the Invention
This invention pertains in general to computerized machine learning and in particular to generating training data for use in machine learning.
2. Description of the Related Art
Databases are widespread in modern computing environments. Companies and other enterprises rely on databases to store both public and private data. Many enterprises provide publicly-accessible interfaces to their databases. Malicious end-users can exploit the database interface to perform actions such as obtaining access to sensitive information. For example, in a Structured Query Language (SQL) injection attack the attacker sends the database a specially-crafted malicious query that can cause the database to reveal sensitive information or perform other malicious actions.
A database intrusion detection system (DIDS) attempts to detect malicious queries. Typically, the DIDS is trained to distinguish between legitimate and anomalous queries using machine learning techniques. Machine learning is useful for training DIDSs and other security systems where the complexity of the incoming traffic frustrates attempts at manual specification of legitimate and anomalous patterns. Machine learning also reduces classification errors such as false negatives or false positives.
Machine learning relies on training data, such as a set of training database queries, captured during data center operations. In traditional supervised machine learning, training data are marked as either legitimate or anomalous so the learning algorithm can correctly differentiate between the two types of activity. Where anomalous training sets are unavailable, as is often the case in security environments, the learning algorithm treats any significant deviation from the legitimate pattern as anomalous.
There is a strong assumption that any activity represented in the captured training data is indeed normal and therefore legitimate. This assumption presents substantial security risks if, in fact, the training data are unknowingly tainted with anomalous activity. As data center complexity grows and attacker sophistication evolves, it is increasingly likely that any significant trace of data center activity captured for use as training data will be tainted to some degree. These latent or covert abnormalities are effectively “grandfathered” into the training data, creating a security risk when the training data are used for detection.
Accordingly, there is a need in the art for a way to generate training data for machine learning that are less likely to contain data representing anomalous activities.