The following relates to the modeling arts, prediction arts, machine learning arts, recommendation system arts, and related arts.
A computer-based modeling system typically entails two operations: training a mathematical representation (model) of an apparatus, operation, system, or other object; and using the trained model to predict or simulate activity, characteristics, or other features of the modeled object. By way of some illustrative examples, in a recommendation system for use in an online retail website the goal is to predict a user's interest in retail merchandise in order to generate personalized product recommendations or advertisements for the user. In one known approach, the recommendation system employs nonnegative matrix factorization (NMF) of a user-item matrix, in which the matrix elements identify items of actual interest to users (based on actual purchases, or based on item web page views, et cetera). NMF decomposes the user-item matrix into factor matrices of lower-rank that cluster together similar items and users. As another illustrative example, in a computer-based medical diagnostic system the goal may be to associate a medical diagnosis with a set of symptoms. NMF can be applied to a diagnosed patient-symptoms matrix to associate typical symptoms to a diagnosis, or to associate a diagnosis to a set of symptoms. In general, the useful output of computer based modeling may be a prediction, e.g. predicting an item likely to be of interest to a user, or may be a data mining output, e.g. discovering data correlations or anti-correlations. As an example of a discovery process, a set of documents may be processed based on content (e.g. using “bag-of-words” representations) to classify the documents for indexing/archival purposes.
Computer based modeling relies upon having a sufficiently large and representative set of training data to generate an accurate model. The trend toward “big data” illustrates this, as organizations endeavor to leverage large internal databases (in the case of a large corporation, government, or the like) or external databases (e.g. the public Internet) to produce complex models for diverse purposes. The quality of these data sets varies. For example, corporate databases may contain vast quantities of data, but the data may be systematically biased based on the product portfolio of the corporation, or the geographical region in which the corporation operates, or so forth. The public Internet provides large quantities of data, but is limited to public information personalized data such as medical diagnoses, credit card purchases, and so forth is generally not publically available.
The so-called “deep web” is the portion of the Internet that is not public. The deep web includes password-protected websites, encrypted private websites, local area networks connected to the Internet by firewalls or other security, and so forth. The deep web contains private data such as medical records, retail purchase records, proprietary survey results, and so forth. These data would be useful for many computer-based modeling tasks, especially if the data from various parts of the deep web could be merged together to form large and diverse training data sets. However, private data on the deep web typically cannot be made publicly available due to personal privacy, confidentiality, and/or proprietary concerns.
One known approach for overcoming this difficulty is the use of smaller-scale collaborations, such as partnerships or consortiums of two, three, or more organizations, whose members agree to share data on some contractually defined basis. Even on these smaller scales, however, data sharing may be hindered by privacy concerns, and/or by an unwillingness to expose private data to potential competitors. Partnerships or consortiums still limit the amount of deep web data that can be merged, to those data belonging to the member organizations, and still further to that sub-set of data those organizations are willing to share.
Data anonymization is another tool for facilitating data sharing. This approach is commonly used in medical research, by removing identifying information such as name, address, and so forth before sharing the data. However, the information removed in order to make the data anonymous can greatly reduce the value of the data. For example, removing address information can hinder disease outbreak geographical modeling. On the other hand, if too little information is removed then the data may not be sufficiently anonymous, leading to patient privacy concerns. Data anonymization of more “free-form” data formats, such as electronic mail (email) messages, can be difficult to automate—for example, in an email message it may be straightforward to automatically strip sender and recipient header information, but it is also necessary to parse the body of the email to identify and anonymize information such as personal names, company names, location names, and so forth. Automatic anonymization of free-form data can be error-prone, again leading to privacy concerns. A further problem is that even in anonymized form the owner of the data may be unwilling to expose it to the public—for example, anonymized medical data may provide a medical company with a substantial competitive advantage it is unwilling to relinquish.
Computing capacity is another concern in leveraging big data sets in computer-based modeling. Even if a consortium is able to overcome the various data sharing hurdles, the resulting enormous mutually shared data set may be too large for the computing capacity of any single member to effectively process. An apparent solution to this is to combine the computing capacities of the consortium members, but there are difficulties. The different members of the consortium may be dealing with different computer-based modeling tasks, and they may be unwilling to commit limited computing resources to solving tasks of other members of the consortium. Sharing computing resources also may require sharing computer code or other task-specific information which the various members may wish to keep confidential or proprietary (even if they are willing to share some of the underlying training data).