In the field of machine learning, one goal is to separate data into classifications based on patterns. Supervised learning is one type of machine learning, in which a classification model is generated based on a data set comprising observations (each observation being a data point) that belongs to known classes, also referred to as labels. The classification model that is generated by supervised learning can then be applied to classify other data sets to predict the labels of the observations in those data sets, which are not known.
Generation of a classification model by supervised learning is based on using a specific type of model. Support vector machine (SVM) is one popular classification model. An SVM model is characterized by a decision boundary, which is a function that defines the boundary between observations that should be classified into different classes. A generalized SVM model may have a simple decision boundary, e.g., a linear decision boundary, which does not classify every observation in the training data set correctly. A less generalized SVM model may have a highly non-linear decision boundary, e.g., a very squiggly decision boundary, which classifies every observation in the training data set correctly. If a generalized SVM model consistently classifies data with an acceptable rate of classification accuracy over a range of different data sets, the SVM model is robust. Less generalized SVM models may perform inconsistently over a range of different data sets, because the decision boundary was generated so specifically to the training data set. These SVM models are not robust. The disclosed methods and systems are directed to generating robust SVM models based on removing outliers from the training data set.