As personal communications devices (e.g., cell phones) are developed to support greater and greater functionality, people are using them to do much more than talk. As is well known, these devices now usually allow their users to access web sites, to run web-based applications, to create media files (e.g., by taking a picture or by recording a video using a camera on the device), and to download media files from remote servers (via a web interface supported by the device). In the course of pursuing these activities, a user generates an enormous amount of information about his preferences and behaviors. Some of this information is explicitly generated when the user sets preferences in a profile. Other information may be implicit, such as the frequency with which the user runs a particular application.
Advertisers and other commercial entities realize how valuable this information, both explicit and implicit, can be. (Of course, entities other than businesses collect behavioral information about entities other than potential customers, but this example serves well to motivate the present discussion.) As advertisers look beyond “traditional” media (e.g., magazines and television) to “new media” (e.g., online and mobile services) in order to increase the effectiveness of their advertising campaigns, the advertisers would like to personalize messages directed to a particular user. If the personalization is based on real information about the user's likes and dislikes, then, in theory at least, the personalized message can be more meaningful to the user than the traditional generic messages broadcast to everyone. For example, a retailer could direct messages to a user who is actively searching for information about products similar to ones that the retailer sells. This allows the retailer to tap into the needs of people prepared to buy rather than, as in the traditional approach, blindly sending advertisements to people who are simply watching television or reading a print medium.
Several technologies have been developed to gather customer information. Web browsers, for example, often track a person's searches and report the search queries to businesses that may provide the products that the person is searching for. It is a common experience to search the web for, say, “snow blowers,” and then see pop-up advertisements for snow blowers just a few seconds after the initial search. Buying habits are also tracked in the check-out lane of the local grocery store, and that information is used to present very specific coupons to the customer along with his receipt. The gathered information is constantly fed to businesses so that the businesses can refine their offerings, locate potential future markets, direct advertising to likely candidates, manage inventory, and the like.
As information is gathered about a particular person, a “profile” of that person is created. From a commercial entity's point of view, the more information fed into a person's profile, and the greater the specificity of that information, the better. To better tailor incentives, a provider of streaming movies would like to know that a given person likes watching westerns but would also like to know that this person only watches westerns after 9 p.m. on weekdays when his little children have gone to sleep.
This example begins to hint at the enormous amount of information that is potentially available to be gathered into a person's profile. To control this huge amount of information, the personal profile is carefully constructed. As is well known, each information sample can be plotted as a point in a multi-dimensional space. The dimensions in the space represent features of a data sample (e.g., where was the user when this sample was collected? how old was he? what was he doing? whom was he with?). The position along a dimension represents the value of that feature. This type of structure makes it relatively easy to “find” the person's preferences in the multi-dimensional space and, from those preferences, to produce reasonably accurate recommendations.
This multi-dimensional way of representing a personal profile has problems, however. There are so many potential features and so many values of those features that the resulting profile begins to consume huge amounts of storage space, creating cost and maintenance problems that only increase as the amount of data gathered for a particular person increases and as the number of persons profiled increases. Also, a traditional personal profile may only cover one domain of the person's activities (e.g., media consumption), making the profile useless for predictions outside that domain. In a related development, even though these profiles may be very large, they are often, from a statistical viewpoint, very “sparsely populated” because they may have only a few datapoints located along any given dimension. This severely limits the predictive power of the profile.