This invention relates generally to data modeling, and more particularly to determining whether a variable is numeric or non-numeric.
Data analysis has become an important tool in solving complex and large real-world computerizable problems. For example, a web site such as msnbc.com has many stories available on any given day or month The operators of such a web site may desire to know whether there are any commonalities associated with the viewership of a given set of programs. That is, if a hypothetical user reads one given story, can with any probability it be said that the user is likely to read another given story. Yielding the answer to this type of inquiry allows the operators of the web site to better organize their site, for example, which may in turn yield increased readership.
For problems such as these, data analysts frequently turn to advanced statistical tools and models, to analyze the data. Data analysis, defined broadly, is the process by which knowledge, models, patterns, decision policies, and/or insight are gained from data. Specific examples of data analysis include pattern recognition, data modeling, and data mining. Other specific applications include: predicting what products a person will want to buy given what is already in his or her shopping basket; predicting what ads a person will click on given what other ads he or she has clicked on, what web pages he or she has read, and/or his or her demographics; predicting what television shows a person will want to watch based on other television shows he or she has watched, and/or his or her demographics. Still other specific applications are listed in the detailed description of the invention.
Generally, data analysis includes three main phases: a problem formulation phase, a model fitting/model selection phase, and a model understanding/visualization phase. Usually, each of these phases is iterated through until the desired knowledge, models, patterns, or insights are achieved. The models or patterns obtained often are then used for prediction for other data.
However, whereas there are many automated or computerized techniques for model fitting/model selection and a few useful automated techniques for explaining statistical models, methods for problem formulation generally are performed by a human, the data analyst. In this phase, the data analyst takes a rough look at the data and uses his or her common sense to form a statistical model or set of models that is then used to fit the data.
For example, a data analyst may be given a set of web-transaction logs from a news site and asked xe2x80x9cuse this data to predict what ads a user is most likely to click throughxe2x80x9d. The problem-formulation phase may proceed as follows. First, the analyst looks at the logs and may recognize that the news stories a user reads (information available in the logs) can be useful for predicting what ads a user will click through. The analyst then decides whether news stories themselves are good predictors, or whether it is better to use news-story categories to predict ad click through. He or she then decides which set of news stories or news categories are worth including in the model, since the inclusion of all stories is impractical.
The result of these decisions is a list of variables to include in the model. Next, the data analyst decides how to model each variable. Although the number of times a story is read is available in the web-transaction logs, the data analyst may decide to model only whether or not a user reads the story. Another alternative may be to retain the more detailed information, in which case the data analyst has to decide whether to model this quantity with a Gaussian distribution or some more complicated distribution, for instance. Finally, the data analyst may decide to model the relationships between stories read and ads clicked using a Bayesian network.
There are disadvantages associated with having to have a data analyst perform the problem formulation phase. The amount of data that is available for analysis is increasing at an exponential rate, but there are a limited number of statisticians/data analysts available who can analyze this data, thus limiting how often statistical models can be utilizes for data analysis. The process of problem formulation is itself difficult to automate because so much human knowledge is typical brought to bear on a particular problem. In the above example, for instance, a computer would generally not know that stories read may be predictive of ad click through, because both are related to the underlying xe2x80x9cpersonality typexe2x80x9d of the user. A computer would also typically not know to model a story variable as binary rather than numeric is appropriate.
For these and other reasons, there is a need for the present invention.
The invention relates to automated data analysis. In one embodiment, relating to an architecture for automated data analysis, a computerized system comprises an automated problem formulation layer, a first learning engine, and a second learning engine. The automated problem formulation layer receives a data set. The data set has a plurality of records, where each record has a value for each of a plurality of raw transactional variables (as is defined later in the application). The layer abstracts the raw transactional variables into cooked transactional variables. The first learning engine generates a model for the cooked transactional variables, while the second learning engine generates a model for the raw transactional variables.
In an embodiment relating to feature abstraction, a data set is input that has a plurality of records, where each record has a value for each of a plurality of raw transactional variables. These variables are organized into a hierarchy of nodes. The raw transactional variables are abstracted into a lesser number of cooked transactional variables, and the cooked transactional variables are output.
In an embodiment relating to creation of a model for raw variables from a model for cooked variables and raw data, a first data model for a plurality of cooked transactional variables is input. The cooked transactional variables have been abstracted from raw transactional variables, where the latter variables are based on a data set comprising a plurality of records, each record having a value for each raw transactional variables. A type of the first model is determined, and a second data model, for the plurality of raw transactional variables, is generated based on the first data model and the type of the first data model. The second data model is then output.
In an embodiment relating to determining whether a variable is numeric or non-numeric, a variable is input having a plurality of values, where each value has a count. The variable is determined to be numeric or non-numeric by assessing closeness of counts for adjacent values of the variable. Whether the variable is numeric or non-numeric is then output.
Finally, in an embodiment relating to determining whether a numeric variable has a Gaussian or a log-Gaussian distribution, a data set is first input. The data set has a plurality of records. Each record has a value for each of a plurality of raw non-transactional variables. The plurality of raw non-transactional variables includes a numeric variable. It is determined whether a Gaussian or a log-Gaussian distribution better predicts the numeric variable, based on the plurality of records. This determination is then output.
Embodiments of the invention provide for automated data analysis, and thus provide for advantages over the prior art. Automated data analysis is useful because data analysts are not needed to perform the data analysisxe2x80x94specifically the problem formulation phase of the analysis process. This makes data analysis more useful because it opens up data analysis to be used in more situationsxe2x80x94for example, where a data analyst may be too expensive or not available to use.