As the costs of data storage have declined over the years, more and more data pertaining to a wide variety of applications can potentially be collected and analyzed using increasingly sophisticated machine learning algorithms. For example, a number of natural language processing (NLP) algorithms have been developed for analyzing and responding to text records, such as records of social media interactions, product support requests, medical status summaries and so on.
Supervised machine learning models for text analysis, including various types of models used for classification, require observation records of a training data set to be labeled—that is, the “true” class or label to which a given record belongs has to be determined for all the records of the training data before the model can be trained to make predictions regarding previously unseen or unlabeled data. In scenarios in which the input data consists of unstructured text, as for example in an environment in which observation records include email messages, social media messages and the like, labeling the records can often be a laborious, time consuming and expensive process. Often, subject matter experts may have to participate in the labeling. In order to obtain a desired level of prediction quality, many modern-day machine learning models may need very large labeled training data sets—in some cases comprising hundreds of thousands or even millions of records.
In many cases, the problems being solved using machine learning models which consume text as input may not be restricted to any particular natural language. Many organizations today have branches and offices in multiple countries with respective languages, or sell products in numerous countries, for example; even within a single country, in some cases multiple languages may be spoken in respective geographical regions. Input data for classification of natural language records associated with a particular domain, such as emailed problem reports directed to a customer support organization, may be received in several different languages. Generating sufficient labeled training data in all the different languages which may have to be processed for a particular machine learning application may represent a substantial technical and logistical hurdle.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.