Multi-dimensional, large-scale datasets are found in diverse subjects, including gene expression data for uncovering the link between the human genome and the various proteins for which it codes; demographic and consumer profiling data for capturing underlying sociological and economic trends; sales and marketing data for huge numbers of products in vast and ever-changing marketplaces; and environmental data for understanding phenomena such as pollution, meteorological changes and resource impact issues.
One challenge for many users dealing with these datasets is how to extract the meaning from the data they contain: to discover structure, find patterns, and derive causal relationships. Very often, the sheer size and complexity of these datasets make it impossible for an analyst to directly glean any meaning from the data without employing some additional operations, such as regression, clustering, summarization, dependency modeling, and classification.
FIG. 1 is a prior art screenshot displaying a portion of a commercial dataset related to the sales and marketing activities of a soft drink company using Microsoft Excel. This dataset has dozens of data fields with different data types. For example, the data type of the “Date” field is time; the data type of the data fields like “Market”, “State”, and “Market Size” is text; and the data type of the data fields like “Sales”, “Profit”, and “Margin” is numeric value. There are many important information items embedded in the raw data; e.g., the most popular product in a state within a specific time period or the least profitable product from the marketing's perspective. But it is quite difficult to access any of them directly from the raw data.
In this regard, data visualization and statistical modeling are powerful tools for helping analysts to explore large datasets. Data visualization can represent a dataset or a portion of the dataset to meet an analyst's interest. For example, the analyst can gain insight into the company's marketing effort from a curve representing the relationship between the “Sales” and the “Marketing” data fields. In many instances, the mere visualization of raw data is not enough. Statistical modeling is often invoked to generate an analytical or numerical model from raw data. Statistical models can be used to predict values, e.g., through interpolation or extrapolation. Statistical models can also be used to test between alternative hypotheses. Hypothesis tests are widely used to confirm findings. In particular, the analyst can easily discover the trends of the market from visualizing the model. From analyzing the visualized model, the analyst can make informed business decisions.
A widely used type of statistical model is a linear model. Linear models relate a response variable to various quantitative and categorical factors using linear coefficients. A specific example of a linear model is linear regression where a y value is predicted from an x value. A special case of linear models is analysis of variance (ANOVA). In analysis of variance, mean values are predicted using factors. For example, the mean response to a drug may depend on the sex and age of the patient.
The conventional manner of generating and visualizing models from multi-dimensional datasets often requires a significant human-computer interaction. To do so, a user must be familiar with the characteristics of the dataset and must also provide detailed computer instructions to generate the models and visualizations. In many situations, a user may have to repeat the process several times to arrive at a satisfactory model. This is extremely inconvenient when a user deals with a large dataset having tens or even hundreds of data fields. The user may have to waste hours of time in order to uncover any significant trends embedded in the dataset.
Consequently, there is a strong need for improved methods and graphical user interfaces for generating and visualizing models.