Data aggregation is a process in which information is gathered and expressed in a summary form for purposes such as statistical data analysis. It often reveals useful information hidden in a large volume of original data records. For example, from a database containing millions of sales records generated by an on-line store, a marketing analyst can learn information about a particular group of consumers such as trends and patterns in their shopping habits by aggregating the related sales records based on specific variables such as product type information, product pricing information, customer age, customer gender, geographic location (e.g., store location or purchaser's address) and any other customer and/or product information available in the database.
As another example, a web search engine may receive millions of queries per day from users around the world. For each query, the search engine generates a query record in its query log. The query record may include one or more query terms, a timestamp indicating when the query is received by the search engine, an IP address identifying a unique device (e.g., a PC or a cell phone) from which the query terms are submitted, and an identifier associated with a user who submits the query terms (e.g., a user identifier in a web browser cookie; in some cases the user identifier may also be associated with a toolbar or other application or service to which the user has subscribed). Appropriate aggregation of these query records can also unveil interesting or useful information about the web search engine users. For instance, a publisher can gauge the popularity of a newly released book in a specific city from the frequencies of relevant queries submitted by users from that city within a given time period.
For the same query log, social scientists, marketers, and politicians may have dramatically different interests and therefore require different types of data aggregations to meet their needs. Some types of “data mining” of a search engine's log records may be useful only if the statistical inquiries receive substantially instantaneous responses (e.g., in less than five seconds). But most of the conventional data aggregation techniques are incapable of deriving reliable statistical information from a large number of query records substantially instantaneously.
Another concern with data mining search engine query logs or commercial transaction logs is the protection of user privacy. Even if the log records do not contain user names or the like, returning statistical information or trends information based on very small numbers of users or transactions (e.g., less than twenty transactions) may inadvertently disclose information that can be traced back to an individuals or small groups of users (e.g., fewer than a predefined number of distinct users, such as twenty, one hundred or two hundred distinct users). It is therefore important that any log record data mining tool include safeguards for preventing the disclosure of information that may be traced back to individuals or small groups of users.