In machine learning, one important problem is how to represent an example using a set of properties (e.g., features, which are also referred to as machine learning features) so that it captures the nature of the example. In spam detection, the example is the document, and the classes to characterize the example are “spam”, “non-spam”, and we want to get good features that can differentiate spams or non-spams with good accuracies.
A typical approach represents a document using its constituent terms as “bag-of-words” features. A modern machine learning toolbox such as Vowpal Wabbit receives the features and the pre-determined classes of the examples as input, and produces a classifier so that given new document, it can classify it as either spam or non-spam.
It is intuitive that by looking at “bag-of-words” for a given document, a human can easily identify smoking guns of spams such as sexual words, advertising terms, or the like, which helps to judge spam. However, a machine will just take the “bag-of-words” features as a set of independent facts, and use a mathematical equation such as linear combination or artificial neural networks (we call it “model” in machine learning) to summarize them into a single confidence score on being spam, with typical scores ranging from 0 (least confident) to 1 (most confident).
Among features, some are relevant to spam detection, others are not. These other features, sometimes referred to as “irrelevant features”, will make model less accurate. A desirable quality is for a model to be robust so that the presence of irrelevant features will not impact the output score, which in practice is very challenging, because a majority of features will be irrelevant features to the problem. In practice a user identifies irrelevant features through training on very large ground-truth dataset, and filter features by the criteria of sensitivity to the models; or by manually cherry picking relevant features based on experiences because bag of words model is interpretable.
Another issue on developing a practical machine learning component is that, the size of bag-of-words model can grow unlimited: each word will have a unique entry in the model storing its parameters, and there may be at least tens of thousands of words if the relevant language is the English language. After including bi-gram features (a feature set which considers a pair of neighboring words as a virtual word to capture context information), the model size grows even larger, sometimes unrealistic for online serving where memory resource is tight.