Conventional labeling technology allows users to label content. The labels are used by a computer to differentiate among different types of content. The labeling technology may assign content to different groups. However, users may label content differently at different times, and different users may label data differently. This uncertainty in label application may impact labeling consistency and label quality.
For instance, conventional spam filters employ labeling technology to identify spam messages. The performance of the conventional spam filters depend directly on label quality and label consistency. Spam filters may be trained from a large corpus of content (e.g. emails or web pages) labeled as spam or not spam. Poorly trained spam filters may admit unwanted spam or, worse yet, incorrectly classify important content as spam.
Improvements in label quality or label consistency may yield superior performance in spam filtering, product recommendation, prioritization, etc. Label quality is affected by factors such as the labeler's expertise or familiarity with the concept or data, the labeler's judgment ability and attentiveness during labeling, and the ambiguity and changing distribution of the content. The label quality may be particularly important in situations where data quantity is limited (e.g., when labels are expensive to obtain or when individuals are labeling data for their own purposes).
To improve label quality and consistency, label noise and concept drift should be managed. The label noise may be identified when several different labels are applied to the same content. Concept drift may be identified by when quickly changing content requires several different labels. The label noise and concept drift are managed by technologies that provide set-based label judgments and temporally applied labels (e.g., by discarding or weighting information according to a moving window that changes as the underlying content changes.)