The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to implementations of the claimed technology.
The Naive Bayes Classifier (NBC) is a stalwart of the machine-learning community, often the first algorithm tried in a new arena, but also one with the inherent weakness of its hallmark assumption—that features are independent. Bayesian networks relax this assumption by encoding feature dependence in the structure of the network. This can work well in classification applications in which there is substantial dependence among certain sets of features, and such dependence is either known or is learnable from a sufficiently large training set. (In the latter case, one may use any of a number of structure-learning algorithms; they don't always work well because the structure-learning problem is very difficult.) Undirected graphical models are a different way to relax this assumption. Like Bayesian networks, they also require domain knowledge or a structure-learning procedure. Domain knowledge can be hand coded into the structure, but this only works when domain knowledge is available and even then hand coding is usually very laborious. Alternatively, structure-learning algorithms require extensive training data and present a computationally complex problem, in addition to the concern of over-fitting a model using a limited data set, thereby creating a model that predicts structures that are less likely to work well on data not seen during training. Undirected graphical models may also fail to work well when either the domain knowledge is wrong or when the instance of the structure-learning problem is either intrinsically hard or there is insufficient data to train it.
The technology disclosed relates to machine learning (ML) systems and methods for determining feature dependencies—methods which occupy a space in between NBC and Bayesian networks, while maintaining the basic framework of Bayesian Classification.
Big data systems now analyze large data sets in interesting ways. However, many times systems that implement big data approaches are heavily dependent on the expertise of the engineer who has considered the data set and its expected structure. The larger the number of features of a data set, sometimes called fields or attributes of a record, the more possibilities there are for analyzing combinations of features and feature values.
Accordingly, an opportunity arises to automatically analyze large data sets quickly and effectively. Examples of uses for the disclosed classification systems and methods include identifying fraudulent registrations, identifying purchase likelihood for contacts, and identifying feature dependencies to enhance an existing NBC application. The disclosed classification technology consistently and significantly outperforms NBC.