Machine learning combines techniques from statistics and artificial intelligence to create algorithms that can learn from empirical data and generalize to solve problems in various domains such as natural language processing, financial fraud detection, terrorism threat level detection, human health diagnosis and the like. In recent years, more and more raw data that can potentially be utilized for machine learning models is being collected from a large variety of sources, such as sensors of various kinds, web server logs, social media services, financial transaction records, security cameras, and the like.
Observation records collected for training machine learning models may include values of a number of different types of attributes, such as numeric attributes, binary or Boolean attributes, categorical attributes and text attributes. The sizes of the data sets used for many machine learning applications, such as deep learning applications, can become quite large. Some machine learning data sets may include values for dozens or hundreds of attributes, and some text attributes may in turn contain hundreds or even thousands of individual words or tokens. A given data set may contain millions of observation records. In general, the time and resources required for training a given predictive model may increase with the data set size.
In order to help train a model to predict values of a target attribute, metrics indicating statistical relationships such as various measures of correlation may sometimes be computed between input attributes and the target attribute. For some types of non-text attributes (e.g., numeric, binary attributes, or categorical attributes), computing such metrics may be fairly straightforward, e.g., using pre-defined functions supported by various statistical software packages or tools. Using the metrics, data scientists or other users of machine learning systems may be able to distinguish between the particular non-text attributes which are superior candidates for inclusion as input parameters of a predictive model, and those non-text attributes which are not likely to be particularly helpful in predicting target attribute values. However, determining the relative predictive utility of text attributes may not be as straightforward, especially because values of a given text attribute may have very large (or widely varying) token counts, repeated tokens, and the like. As machine learning data sets incorporate more and more text-based data from social media applications, short message service (SMS) applications, e-mail and the like, the importance of identifying text attributes with superior predictive capabilities is only likely to increase.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.