Approaches for predicting the value of a dependent response variable based the values of a set of independent predictor variables have been developed by practitioners in the art of the statistical analysis and data mining for a number of years. Also, a number of conventional approaches for modeling data have been developed. These known techniques require a set of restrictive assumptions about the data being modeled. These assumptions include, e.g., a lack of noise, statistical independence, time invariance, etc. Therefore, if the real data being modeled is dependant on certain factors which are contrary to the assumptions required for the accurate modeling by the conventional techniques, the results of the above-described conventional data modeling would not be accurate.
This is especially the case in the presence of temporal, non-stationary data. Indeed, no robust approach which considers such data has been widely used or accepted by those in the art of the statistical analysis. For a better understanding of the difficulties with the prior art approaches, temporal data and non-stationary data are described below.
Temporal data refers to data in which there exists a temporal relationship among data records which varies over time. This temporal relationship is relevant to the prediction of a dependent response variable. For example, the temporal data can be used to predict the future value of the equity prices, which would be based on the current and past values of a set of particular financial indicators. Indeed, if one believes in the importance of trends in the market, it is not enough to simply consider the current levels of these financial indicators, but also their relationships to the past levels.
In another example, the supermarket application may prefer to group certain items together based on the purchasers' buying patterns. In such scenarios, the temporal data currently used in such supermarket application is the data provided for each customer at the particular checkout, i.e., a single event. However, using the data at the checkout counter for a single customer does not take into consideration the past data for this customer (i.e., his or her previous purchases at the counter). In an example of an intrusion detection system, the use of the time-varying data is very important. For example, if a current login fails because the password was entered incorrectly, this system would not raise any flags to indicate that an unauthorized access into the system is being attempted. However, if the system continuously monitors the previous login attempts for each user, it can determine whether a predetermined number of failed logins occurred for the user, or if a particular is sequence of events occurred. This event may signify that an unauthorized access to the system is being attempted.
Non-stationary data refers to data in which the functional relationship between the predictor and response variables changes when moving from in-sample training data to out-of-sample test data either because of inherent changes in this relationship over time, or because of some external impact. For example, with a conventional network intrusion detection system, a predictive model of malicious network activity can be constructed based on, e.g., TCP/IP log files created on a particular network, such as the pattern formed from the previous intrusion attempts. However, intruders become more sophisticated in their attack scenarios, attack signatures will evolve. In addition, the conventional intrusion detection systems may not be usable for all conceivable current operating systems, much less for any future operating systems. An effective intrusion detection system must be able to take into consideration with these changes.
One of the main difficulties being faced by the conventional predicting engines is that the data is “multi-dimensional” which may lead to “over-fitting”. While it is possible to train the prediction system to make the predictions based on the previous data, it would be difficult for this system to make a prediction based on both new data and the data which was previously utilized to train the system. The conventional systems utilize predictor values for each category of the data so as to train themselves as described above. For example, if the prediction system intends to predict the performance of certain baseball teams, it would not only use the batting average of each player of the respective team, but also other variables such as hitting powers of the respective players, statistics of the team while playing at home, statistics of the team when it is playing away from home, injury statistics, age of the players, etc. Each of these variables has a prediction variable associated therewith. Using these prediction variables, it may be possible to train the system to predict the performance of a given baseball team.
However, the conventional systems and methods described above are not flexible enough to perform its predictions based on a new variable (e.g., the number of player leaving the team) and a new corresponding prediction variable being utilized for the analysis. In addition, it is highly unlikely that the data values being utilized by the conventional systems and methods, i.e., after the system has already been trained, is the same as or similar to the data of the respective prediction variables that were already stored during the training of this system. The above-described example illustrates what is known to those having ordinary skill in the art as “over-fitting”. As an example to illustrate this concept, the system may only be trained using training data (e.g., in-sample data) which can represent only 0.1% of the entire data that this system may be required to evaluate. Thereafter, the prediction model is built using this training data. However, when the system is subjected to the real or test data (e.g., out of sample data), there may be no correlation between the training data and the real or test data. This is because the system was only subjected to training using a small portion of the real/test data (e.g., 0.1%), and thus never seen most of the real or test data before.
There is a need to overcome the above-described deficiencies of the prior art systems, method and techniques. In particular, there is a need to provide a method and system for classifying data that is temporal and non-stationary.