1. Field of the Invention
The present invention relates to a system and method for exchanging, integrating and analyzing information from multiple sources, without risking the divulging of potentially confidential information from any of those sources. Specifically, the present invention relates to the use of a data agent that collects, analyzes and aggregates information related to a data set of interest and forwards the results in a form that can be combined with like data from other sites, and without divulging confidential information contained in the data set.
2. Discussion of the Related Art
The analysis of data generally requires a sufficient set of data points to determine whether results represent real correlations or whether they represent random coincidence. In many industries, there are questions that cannot be answered by any one institution because the size and variation of its dataset is insufficient. Competitors, collaborators, and regulators, may have mutual interest in sharing data to provide a joint body of information for answering questions in which they are interested. However, due to competitive, regulatory, or other concerns of trust, institutions may be reluctant to disclose such data, particularly identifying data. Moreover, in other regulated industries, such as healthcare or finance, or in industries where privacy is implied, sharing of certain data is prohibited. Accordingly, a current need exists for a methodology for exchanging, integrating and analyzing information using a technique that can overcome these concerns and prohibitions and provide data of sufficient size and variation with the added benefit of ensuring anonymity of data providers.
Current techniques and systems attempt to address these confidentiality and disclosure problems through the use of various data filters that attempt to forward relevant data while preventing the dissemination of private information, by removing personal identifiers. The data filter may be located at a data source. For example, data may be collected from a hospital using an application that strips patient information from the data records before sending the data records for statistical analysis. Alternatively, other known data stripping utilities operate at a data analysis location, removing confidential information from data acquired from distant location, either before or after statistical analysis of the acquired data. The problem with these methods is four-fold. First, the anonymization techniques used are often reversible given other external information, or are insufficient to completely anonymize the individual. Second, the data records themselves are no longer under control of the source site, and so could be used inappropriately. Third, to fully anonymize the data may require removal of important fields other than explicit identifiers. This loss of fields or variables may put constraints on the utility of anonymized data in a pooled analysis. Fourth, removing data that might identify an individual might also impede the ability to find and analyze rare events. For meaningful analysis of rare events, which by definition occur infrequently, all data points should be included because sampling techniques are inappropriate and may miscount or otherwise distort the occurrence of the rare events. Not only might the data be removed for de-identification, but the analysis cannot be performed at individual sites and then combined, because rare events will not show up as significant in local analyses.
One relevant example of the points described above occurs in the healthcare industry where, for example, many hospital records systems may not provide release dates, exact age, or indicators of rare medical conditions if they are sufficiently rare to identify the individual. Accordingly, a need exists for an automated data collection technology that is more robust, thereby allowing data collection over a variety of different sources and searches, without losing access to data of interest.