1. Field of the Invention
The present invention relates to data object comparison and analysis, and in particular to software for comparing two or more data objects to determine the extent of any similarities between them.
2. Discussion of the Related Art
Companies increasingly rely on software to provide not only products for consumers or their institutions, but also to manage their day-to-day operations. Software code has therefore become a valuable intellectual property (IP) asset.
The ever-increasing complexity of computer software programs as well as tight development schedules force programmers to become more efficient. One way for programmers to meet these challenges is by reusing source code and adapting it to new applications rather than writing the source code from scratch.
To this end, open source software has become increasingly popular. Open source software is software source code that is publicly available and freely downloadable from the Internet. Thus, open source software code is a convenient resource for programmers looking to cut development time by downloading it and merging it with their proprietary application. In addition, the growth of the open source software movement may also motivate computer programmers to donate or contribute software to the open source movement that they have written but that is owned by their employer. The problem is that most open source software, while freely available for downloading is not in the public domain.
In particular, open source software is not unrestricted—to the contrary it is often subject to licenses that restrict not only the open source software code itself but any modification thereof and any software that incorporates it as well. Typically, these open source licenses may require that the source code of any proprietary system using some open source software code be publicly disclosed. In other words, a programmer who uses open source code in a proprietary application may unintentionally subject that proprietary application to the constraints and restrictions of an open source license. This may have devastating affects on the ability of the company to protect software IP or pursue further intellectual property protection for their software.
In addition, open source software has another inherent risk—it is unknown to what extent open source software incorporates proprietary technology owned by others. Thus, even if open source software is free of any licensing restriction, such as open source software that is in fact committed to the public domain, the possibility remains that the software may infringe another's patents or property rights. A programmer who incorporates this open source code into their proprietary application may unintentionally subject his employer to unforeseen consequences such as infringement litigation.
Furthermore, the rapid growth of the software industry has driven many programmers and software engineers to change employers regularly and often. There is a problem that as these workers move between jobs, they may be taking proprietary source code that they wrote for a previous employer with them to their new employment. Programmers may not be aware or may not be sensitive to these concerns, and risk an inadvertent technology transfer or intellectual property transfer.
In addition, as companies increasingly rely on overseas or offshore development firms for software programming, there is a concern that the overseas development company may be reusing source code that it wrote for one client (who has rights to that software) for projects it works on with other clients.
The problem is not limited to computer source code. In addition to source code, design documents and technical specifications may be indicative of patent infringement or may be used to invalidate patents. But due to the relative ambiguity of terms of art in the software and business methods fields as well as the non-technical nature of language that is often used in patents, it is very difficult to assess IP risks properly and efficiently.
These IP risks are more serious given the tight regulatory environment in which companies operate. Corporate regulations, such as those collectively known as “Sarbanes-Oxley”, require that firms monitor their intellectual property assets as well as the financial risks to their business perform regular IP and risk audits, and report the same to their shareholders, regulators, and the public.
But given that programmers often modify source code slightly when reusing it, it becomes difficult to perform IP software risk audits using redline or other character-based comparison methods. Thus, what is needed in the art is a multi-dimensional approach to comparing two or more corpuses, such as source code, documents, file objects, collections of data or file objects, or databases, that is able to determine the extent to which one corpus resembles another even when the particular structure or content of the two corpuses vary.