Machine learning, a type of artificial intelligence, includes the study of computing devices that are able to recognize patterns in data. One way to enable a computing device to recognize patterns in data (and assign labels to tokens of data) is to train the computing device using a training set of data, commonly called a training corpus. The computing device can then analyze the training set of data to create a classifier model usable to analyze other data sets.
A commonly used training set of data for machine learning is the Penn Treebank II corpus. This corpus includes news articles that are annotated with tags (labels) indicating the parts of speech (POS) of the words within the corpus. The labels provide the computing device with “correct answers” for the training set of data that permits the generation of a classifier model during a training process.