This specification relates to digital data processing and, in particular, part-of-speech tagging.
Semi-supervised learning (SSL) is the use of small amounts of labeled data with relatively large amounts of unlabeled data to train predictors. In some cases, the labeled data is sufficient to provide reasonable accuracy on in-domain data, but performance on even closely related out-of-domain data may lag far behind. Annotating training data for all sub-domains of a varied domain such as all of Web text can be impractical, giving impetus to the development of SSL techniques that can learn from unlabeled data to perform well across domains. An early SSL algorithm is self-training, where one makes use of a previously trained model to annotate unlabeled data which is then used to re-train the model. While self-training is widely used and can yield good results in some applications, it has no theoretical guarantees except under certain stringent conditions.
Other SSL methods include co-training, transductive support vector machines (TSVM), and graph-based algorithms. A majority of SSL algorithms are computationally expensive; for example, solving a TSVM exactly is intractable. Thus there is a conflict between wanting to use SSL with large unlabeled data sets for best accuracy, and being unable to do so because of computational complexity. Graph-based SSL algorithms are an important subclass of SSL techniques that have received attention in the recent past as they can outperform other approaches and also scale easily to large problems. Here one assumes that the data (both labeled and unlabeled) is represented by vertices in a graph. Graph edges link vertices that are likely to have the same label. Edge weights govern how strongly the labels of the nodes linked by the edge should agree.