1. Field
This application relates to distributed software systems.
2. Description of Related Art
Solving certain important computational problems currently requires an amount of time which grows exponentially with the size of that problem. As a result, single computers can practically solve only small instances of such problems. Large networks, such as the Internet, have the potential to solve larger instances significantly faster. The ability to solve such problems many times faster than a single computer has substantial academic, financial, and social implications and greatly impacts such fields as medicine, management, systems engineering, and others. For example, the ability to determine proteins' minimal-free-energy structure within days (as opposed to years) could lead to cures or treatments of cancers, HIV, and other life-threatening diseases. As another illustration, the ability to accurately predict the optimal allocation of resources to a project could dramatically cut costs of public and private projects. The $14 billion dollar “Big Dig” highway construction project of Boston, Mass., for example, would likely have benefited substantially from the availability of such predictions.
A problem faced by practitioners in the art is that designing a software system to distribute the computation over a large private or public network almost invariably means disclosing the input and algorithm to others. That is, the involved data does not remain private throughout the computation. For instance, several systems for distributing computation over a large network have been realized, such as Google's MapReduce and Amazon's EC2. Additionally, various large scale computing efforts for computationally-intensive problems over the Internet have been proposed or implemented. Examples include SETI@home and Folding@Home. The methods leading to the solutions of these problems disclosed inputs, algorithms, and outputs to the Internet nodes.
Many illustrative scenarios can be contemplated wherein the computing power of a large network may be highly desirable given the nature of a particular problem, but where failure of privacy will deter enterprises from developing systems to distribute the computation. One example of such a problem is an “NP complete” problem. NP complete is an important class of problems having the properties that (i) any solution to the problem can be verified quickly in polynomial time, and (ii) if the problem can be solved quickly, then so can every problem in NP. A main characteristic of these problems is that no quick solution to them is known and computation times may dramatically increase with the size of the problem. Important NP complete problems having significant practical applications need to be solved. Conventional techniques, however, have failed to provide for distributed systems to solve these and similar problems without compromising privacy issues.
In a first illustrative scenario depicting the privacy problem associated with existing approaches, a pharmaceutical company has generated a series of candidate proteins for treating a particular cancer. The company needs to predict the 3-D structure of the proteins as they would fold within the human body but the proteins' amino acid sequences are valuable intellectual property and must remain private. The protein folding problem is NP-complete, and thus for reasonably-sized proteins, it could take years on a single computer, or even on small private networks, to compute the desired structures. The company is unwilling to use existing approaches to distribute the computation on a public network because they distribute the amino acid sequences to all helping nodes.
A second illustrative scenario involves image recognition, which is at the heart of many advanced artificial intelligence and security tasks. Matching faces seen in a camera to a database of known criminals allows automated intruder detection and aids security at public locations such as airports and casinos. However, facial recognition and image matching problems are NP-complete and many people may enter the location of interest at once. Further, any employed solution must execute quickly to deliver results in real-time. In order to protect the identity and privacy of the innocent individuals entering the location, the system must either guarantee that the entire computation takes place on a large private network which is capable of preserving privacy. Traditional approaches do not provide for such a mechanism.
What is needed is an architecture for allowing the creation of privacy-preserving distributed software systems, where the data involved remains private during and after the computation.