A preliminary step in almost all data processing systems is preparation of input data for analysis. The type and extent of preparation generally depends on the particular application, but normally includes various content transformation steps, which correct, augment, or transform individual elements of each data record. For example, numerical data may be rounded to a fixed decimal format, dates may be standardized to a particular date format, and percentages expressed as decimals. In information retrieval systems, as another example, where text documents are indexed for searching, it is conventional to perform content transformation rules prior to constructing an index. These steps normally include tokenization, stemming, normalizing case (case folding), aliasing, correcting misspelled words, and expanding contractions. After applying these transformation rules on a document (or set of documents) indexing and other operations can be performed.
The selection of which content transformation rules to apply is generally done heuristically by the system designer based on the particular desired outcome. Typically, transformation rules for unstructured content are applied a) as a group, for example, all selected stemming rules are applied to all content, and b) independent of a specific outcome or measure.
Predictive modeling systems attempt to predict an outcome of some future event based on given set of inputs. Typically, the inputs are various numerical measures of the entities along dimensions that are relevant to the desired outcome. The outcome can be binary, numerical, or represent a class or category. For example, a predictive modeling system may be used to predict a binary outcome of whether loan holders are likely to default on their loans, in which case the inputs are typically such measures as the amount of the loan, interest rate, credit score, number of late payments, and other numerical measures.
In predictive modeling systems that attempt to predict human behavior, the use of structured numerical inputs have tended to dominate. However, recent developments in predictive systems have sought to use unstructured textual information as an input to the predictive system. For example, in predicting potential loan defaults, it may be beneficial to include text based information such as emails received from a loan holders, notes taken from customer service agents who have contacted a loanholder, or messages left by a loanholder on a banks' voice mail system (which can then be converted to text). Each of these sources of textual information may provide information that can improve the effectiveness of the predictive model. Another example of predictive modeling would be classification of a customer's potential profitability based on not merely their purchase history, but on textual information from conversations extracted from emails, letters, telephone conversations, and the like.
When applying textual information as an input to a predictive modeling system, it is necessary to prepare the text by applying standard data preparation and content transformation rules, such as those mentioned above. For example, the raw text of customer emails is first tokenized into individual word units called tokens. After tokenization, the tokens may be corrected for spelling, stemmed, and normalized via a thesaurus. Once transformed, the transformed tokens from each customer email would be represented as an input to the predictive model, using any variety of indexing, vectorization, or other representation schemes.
The problem with this approach is that the choice of which content transformation rules to apply (and their order of application) is conventionally made without regard to their potential impact on the effectiveness of the predictive model. However, predictive modeling systems are often sensitive to subtle variations in the input data, and thus the application of a set of content transformation rules may itself influence the effectiveness of the model. Conventional approaches that assume that particular content transformation rules (e.g., stemming) are always appropriate thus fail to recognize the impact of such rules on the quality of the predictive model. In particular, arbitrary application of content transformation rules may result in a loss of predictive power by masking information that is predicatively relevant.
Accordingly, it is desirable to selectively determine which content transformation rules to apply to input data in predictive modeling systems based on the rules' likelihood of improving the predictive model on new data.