Data analysis has become an important tool in solving complex and large real-world computerizable problems. For example, a web site such as www.msnbc.com has many stories available on any given day or month. The operators of such a web site may desire to know whether there are any commonalties associated with the viewership of a given set of programs. That is, if a hypothetical user reads one given story, can with any probability it be said that the user is likely to read another given story. Yielding the answer to this type of inquiry allows the operators of the web site to better organize their site, for example, which may in turn yield increased readership.
For problems such as these, data analysts frequently turn to advanced statistical tools and models, to analyze the data. Data analysis, defined broadly, is the process by which knowledge, models, patterns, decision policies, and/or insight are gained from data. Specific examples of data analysis include pattern recognition, data modeling, and data mining. Other specific applications include: predicting what products a person will want to buy given what is already in his or her shopping basket; predicting what ads a person will click on given what other ads he or she has clicked on, what web pages he or she has read, and/or his or her demographics; predicting what television shows a person will want to watch based on other television shows he or she has watched, and/or his or her demographics. Still other specific applications are listed in the detailed description of the invention.
Generally, data analysis includes three main phases: a problem formulation phase, a model fitting/model selection phase, and a model understanding/visualization phase. Usually, each of these phases is iterated through until the desired knowledge, models, patterns, or insights are achieved. The models or patterns obtained often are then used for prediction for other data.
However, whereas there are many automated or computerized techniques for model fitting/model selection and a few useful automated techniques for explaining statistical models, methods for problem formulation generally are performed by a human, the data analyst. In this phase, the data analyst takes a rough look at the data and uses his or her common sense to form a statistical model or set of models that is then used to fit the data.
For example, a data analyst may be given a set of web-transaction logs from a news site and asked "use this data to predict what ads a user is most likely to click through". The problem-formulation phase may proceed as follows. First, the analyst looks at the logs and may recognize that the news stories a user reads (information available in the logs) can be useful for predicting what ads a user will click through. The analyst then decides whether news stories themselves are good predictors, or whether it is better to use news-story categories to predict ad click through. He or she then decides which set of news stories or news categories are worth including in the model, since the inclusion of all stories is impractical.
The result of these decisions is a list of variables to include in the model. Next, the data analyst decides how to model each variable. Although the number of times a story is read is available in the web-transaction logs, the data analyst may decide to model only whether or not a user reads the story. Another alternative may be to retain the more detailed information, in which case the data analyst has to decide whether to model this quantity with a Gaussian distribution or some more complicated distribution, for instance. Finally, the data analyst may decide to model the relationships between stories read and ads clicked using a Bayesian network.
There are disadvantages associated with having to have a data analyst perform the problem formulation phase. The amount of data that is available for analysis is increasing at an exponential rate, but there are a limited number of statisticians/data analysts available who can analyze this data, thus limiting how often statistical models can be utilizes for data analysis. The process of problem formulation is itself difficult to automate because so much human knowledge is typical brought to bear on a particular problem. In the above example, for instance, a computer would generally not know that stories read may be predictive of ad click through, because both are related to the underlying "personality type" of the user. A computer would also typically not know to model a story variable as binary rather than numeric is appropriate.
For these and other reasons, there is a need for the present invention.