Data mining extracts patterns and knowledge from a large amount of data. As one of the data mining tasks, anomaly detection identifies items, events, and patterns in a data set which occurrence is considered to be rare and unusual compared with the rest of the data. Thus, anomaly detection not only enables to detect structural defects or errors in the data but also abnormal data points in the data set which is possibly a sign of abuse of the data or intrusion to a database network. Correcting the defects of errors in the data set can improve the accuracy in the data set. Further, early detection of malicious activities can provide system analysts to timely respond to such behavior and allows them to either remove the data points or make suitable changes to ensure the system operation. Anomaly detection has been expected to shed light on controlling manipulative malicious activities in the field of social welfare, credit card, transportation systems, the Internet networks, and healthcare systems.
Several different anomaly detection techniques have been proposed to identify known and unknown rare events. For example, monitoring user's behaviors and detecting two types of anomalous activities, blend-in anomalies and unusual change anomalies, for detecting malicious insiders is presented, such as described in commonly-assigned U.S. Patent Application Publication No. 2015/0235152, pending, the disclosure of which is incorporated herein by reference. Further, a combination of suspicion indicators from multiple anomaly types is presented to detect suspicious pharmacies from a large data set of pharmacy claims, as described in Eldardiry et al., Fraud Detection for Healthcare, In Proceedings of Knowledge, Discovery, and Data Mining (KDD) 2013 Workshop on Data Mining for Healthcare (DMH), Chicago, Ill., Aug. 11, 2013, the disclosure of which is incorporated herein by reference. Moreover, for multiple domain information, an anomaly detection method for integrating multiple sources of activity data to detect insider threat is presented, as described in Eldardiry et al., Multi-Source Fusion for Anomaly Detection: Using Across-Domain and Across-Time Peer-Group Consistency Checks, Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications, Vol. 5(2), pp 39-58, June, 2014, the disclosure of which is incorporated herein by reference. However, the rare events do not necessarily imply that such events are malicious. For example, the rare events can be caused by other factors which arise from normal activities and may be false positive rare events. Although the existing anomaly detection techniques provide opportunities for the system analysts to review and reevaluate the rare events, casual observation by the human analysts do not contribute to the overall improvement of the anomaly detection system.
Anomaly detection techniques can be broadly categorized into two types, a rule-based method and statistical method. The rule-based method employs machine learning algorithms to identify predetermined patterns of anomalies and non-anomalies (normal) from the data set. Although the rule-based method can bring accurate and swift results of anomalies, the method is not adoptable to identify unknown anomaly patterns which are not covered by the known anomaly rules. Thus, the rule-based anomaly detection is susceptible to new forms of rare patterns which can emerge over time. To identify a broad range of rare patterns, the statistical method has been used to statistically discover rare patterns. The statistical method analyzes the data set and discovers data points which do not follow with an expected pattern or other items in the data set. Since the comparison of the data points in a specific data set is made based on an assumption that most of the data points in the data set follow a normal pattern and there is lack of domain knowledge in regard with anomalies, the data points identified as rare by the statistical method may include false positive anomalies.
Therefore, there is a need for facilitating anomaly detection methods for accurately identifying both known and unknown anomalies and reflecting domain knowledge and expertise.