This specification relates to information processing.
The advent of cloud-based hosting services has enabled many opportunities for service developers to offer additional services that are of much utility to users. Examples of such services include automatically generating electronic reminders for users, providing advertisements that may be of particular interest to particular users, providing suggestions for activities in which the user may be interested, and the like.
To offer these services, a service provider may process a large set of documents for a large number of users in an effort to determine particular patterns in the documents that are indicative of a need for a particular service. To illustrate, a service provider may process messages from an on-line retailer and determine that an order confirmation includes data describing a product and a delivery date. Using this information, the service provider may generate an automatic reminder for a user that serves to remind the user the product is to be delivered on a certain day.
Such information derived from the documents and that is used by a service provider to provide services is generally referred to as a “document data collection.” A document data collection can take different forms, depending on how the data are used. For example, a document data collection can be a cluster of documents or a cluster of terms from the documents, where the data are clustered according to a content characteristic. Example content characteristics include the document being a confirmation e-mail from an on-line retailer, or messages sent from a particular host associated with a particular domain, etc. Another type of document data collection is a template that describes content of the set of documents in the form of structural data. Other types of document data collections can also be used.
A service provider may need to analyze and modify the document data collection to improve the performance of the services that utilize the collection. Examination of private data, however, is often prohibited, i.e., a human reviewer cannot view or otherwise have access to the document data collection. Usually during the generation of the document data collection any private user information is removed and not stored in the document data collection; regardless, examination by a human reviewer is still prohibited to preclude any possibility of an inadvertent private information leak. While such privacy safeguards are of great benefit to users, analyzing and improving the quality of the document data collection and the services that use the document data collection can be very difficult due to the access restrictions.