Electronic data collection methods allow enterprises to collect and store large amounts of data about their customers. This is especially true in enterprises having a strong consumer focus, such as retail, financial, communications and marketing organizations. Data collected can be used by enterprises to better understand the needs, preferences and purchasing patterns of their customers.
For example, an electronic commerce (e-commerce) web site having around one-hundred million users may collect data on products that interest each user. User interest would be measured by noting whenever the user “clicked” on the product. The user and the products in which the user was interested would be collected and stored. If the e-commerce web site wanted to initiate a advertising campaign for a new product, the site could use the collected data to focus the campaign to that portion of its one-hundred million users that are likely to purchase the new product. Using the collected data to formulate this type of targeted marketing plan would make the advertisement campaign more efficient, effective and economical.
Data collected is stored in a large data set, and may be stored in a variety of formats. One such format is a two-dimensional (2-D) format using rows and columns. In this 2-D format, each row contains a sample and each column contains a variable or feature. In the above example, a sample may correspond to a user's identity and a feature correspond to a product that was clicked on by the user. A large data set typically can contain billions of samples and millions of features. This means that a large data set easily can contain more than a terabyte (1012 or one trillion bytes) of data.
Processing such a large amount of data can be difficult. Processing of the large data set is performed to extract information that can be useful to an enterprise. This useful information includes, for example, information about historical patterns and predictions about future trends. Processing extracts useful information from the large data set by discovering correlations or relationships between samples and features. A large data set contains too much data to be processed in its entirety by loading all the data into memory of an application, such as a database application.
One type of processing technique involves making predictions based on data in the large data set. In general, prediction processing techniques use a portion of the data to build a prediction model. A prediction model is a mathematical model that makes predictions based on correlations or relationships among features. After the prediction model is built, one sample at a time is loaded into the prediction model and processed to make a prediction of that sample.
One problem with these prediction processing techniques is that specialized computer code must be used to load each sample into the prediction model for processing. A database application cannot be used because the entire large data set far exceeds the memory capacity of the application. Because a conventional database application cannot be used, specialized computer code specific to the format of the data in the large data must be written to load each sample into the prediction model. This is often time consuming and difficult. These techniques frequently are used in research or academic environments by those who are capable and willing to write specialized computer code. In a business environment, however, it is a burdensome and expensive task for an enterprise to have to write computer code customized for its data set. Instead, an enterprise would prefer to have a prediction processing technique that requires no specialized skills or knowledge.
Accordingly, a need exists for a method and a system for processing a large data set that obtains predictions valid for the entire data set while using only a fraction of the entire data set.