In addition to being able to follow a set of pre-defined instructions, such as in a computer program, computers can also operate by collecting data and extrapolating from the data an appropriate action. Such behavior is commonly referred to as “machine learning” because the computing “machine,” rather than being simply instructed as to the appropriate action given a particular set of inputs, “learns” the appropriate action by collecting relevant data and extrapolating the appropriate action from the data. One common approach to machine learning is to use statistical algorithms that calculate various statistical properties of an observed set of data and then use the resulting statistical model to determine appropriate future actions.
Frequently, statistical machine learning is used to form predictions based on statistical similarities to previously observed data. For example, an electronic mail (email) program can classify emails as being “junk” based on the statistical similarities of the currently received emails to previously received emails that the user has already indicated were, or were not, junk. Similarly, electronic commerce software can classify transactions as being “legitimate” or “fraudulent” based on the statistical similarities of the current transaction to prior transactions that have been verified to be either legitimate or fraudulent.
One example of a statistical learning system is the Naïve Bayes Classifier (NBC). An NBC essentially counts features of the data on which the statistical model will be based. Thus, an NBC will represent its statistical model as a list of “counts,” with one count per “feature” per “class.” A “count,” as used in connection with NBCs, is a value indicating the number of times that a particular feature was associated with a particular class in the observed data set. A “feature,” as used in connection with NBCs, is an element of the data. Common features of email data, such as would be counted for purposes of identifying junk email, can include the sender, the sending host, the words of the title and the words of the email text itself. Finally, a “class,” as used in connection with NBCs, is one of the classifications which the NBC will assign to new data. Thus, in an email program, multiple classes could exist, including “junk,” “priority,” “work,” “personal” and the like. In electronic commerce, a binary classification could exist, such as classifying each transaction as either “legitimate” or “fraudulent.”