1. Field of the Invention
This invention relates generally to databases and, in particular to the coding and compression of database information and tables.
2. Brief Description of the Prior Art
A database is a collection of logically related data, usually stored on computers, set up either to be questioned or queried directly, or to provide data to one or more applications. Typically, data in a database is logically represented as a collection of one or more tables. Each table is composed of a series of rows and columns. Each row in the table represents a collection of related data. Each column in the table represents a particular type of data. Thus, each row is composed of a series of data values, one data value from each column.
A database and its applications may reside on one computer, or the database may be distributed over a number of computers that are connected by a network, such as a local area network, a virtual private network, or the Internet. Duplicates of part or all of a database may be stored on different computers for performance, availability, or other reasons. The applications that use a database may reside on one of the computers where the database resides or may reside on other computers connected to the database over a network.
A database that is logically organized into tables of data is typically managed by a Relational Database Management System (RDBMS). The RDBMS provides commonly needed services familiar to one skilled in the art, such as: means to retrieve data in a related way from more than one table in response to a question; means to update data; means to ensure integrity of the data with respect to constraints; means to control access to the data; and means to index the data for rapid access.
Typically, a question put to a database will be written in a notation that is based on a mathematical construct called the Relational Algebra. The answer to a question is itself a table. There are three main operations in the Relational Algebra that can be used together to construct an answer table for a question: Projection of a table on some of its columns results in a new table consisting of the set of rows obtained by omitting the remaining columns; selection of rows meeting a certain criterion from a table results in a new table consisting of only those rows of the original table that meet the criterion; joining two tables results, conceptually, in a new table having rows formed by appending a row of the first table to a row of the second table; and selecting only such rows that have the same values in certain designated columns in the two tables. One skilled in the art will recognize the constructs of the Relational Algebra in the Structured Query Language (SQL) that is a common means of accessing and manipulating data in an RDBMS.
The data in a database often contains information that should be held in confidence and that should only be made available to authorized users or programs. For example, the data may be confidential to a certain business organization or it may contain military secrets. One skilled in the art will be familiar with RDBMS access controls. These access controls basically allow certain privileges, such as the permission to question or to update the data, to selected user identifications or programs based on the knowledge of a password or passwords. As such, RDBMS access controls provide a first line of defense for confidential information that is held in a database. However, experience shows that while there are strong reasons for making data from an RDBMS available over networks to authorized users or programs, there is an ongoing cycle of penetration by unauthorized users followed by incremental improvement in access controls. This can be seen by visiting the United States National Infrastructure Protection Center (NIPC) at www.nipc.gov. For example, NIPC advisory 01-003 lists a security hole that allows unauthorized users to tunnel Structured Query Language (SQL) requests through a public connection to a private back-end network. It is believed that unauthorized users have obtained the details of many credit cards by such methods, see www.sans.org/newlook/alerts/NTE-bank.htm.
A second line of defense that is familiar to one skilled in the art is to encrypt some or all of the entries in the tables in a database using a standard method, such as the Data Encryption Standard (DES) or public key cryptography. However, this line of defense is also subject to a cycle of penetration followed by improvements. In addition, there is currently active research into advances in mathematics and software that could lead to rapid methods of unauthorized decryption of data that has been encrypted using these standard methods. Moreover, some information, such as the number of rows in a table, remains available to unauthorized users or programs. In addition, the performance of the RDBMS for authorized users is reduced by the need to perform decryption for every query and encryption for every update.
There is a need in the art for an improved method of hiding data from unauthorized users and programs, while making it efficiently available to those who are authorized.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, apparatus, and article of manufacture for a computer implemented encoder for encoding tables in a database, and optionally simultaneously reducing the space required to store the tables. This invention offers a third line of defense based on a semantic encoding method and system that is different from either access control or encryption of database entries. Semantic encoding can be used standalone or with any combination of prior methods.
It is an object of the present invention to provide an improved system for encoding tables in a database. It is another object of the present invention to provide an improved system that is compatible with prior methods for securing data in a database. It is another object of the present invention to provide a system such that, after an unauthorized attempt to decode a database table, an attacker cannot tell by looking at the output that he has or has not succeeded in reconstructing the table correctly. It is a further object of the present invention to provide a system such that, after an unauthorized attempt to decode a database table, an attacker cannot tell by looking at the output how many rows were in the original table. It is a still further object of the present invention to provide an improved system for compressing a database while making it secure. It is another object of the present invention to provide an improved system for making only certain approved parts of a database available to particular users, groups of users, or applications. It is a further object of the present invention to provide an improved system for protecting a data provider""s commercial interest in the data in a database, in a situation in which authorized users are billed for answers to questions that they put to the database.
Specifically, the present invention relates to a novel way of securing the contents of a database and of making those contents available only to authorized individuals, groups of individuals, or programs. Authorization is given by making known a collection of keys or key numbers and/or permutations. This invention can be used in isolation and it can also be used to complement the prior art. The present invention is based on a principle that is different from the principles underlying the prior art that includes access control and encryption. While access control and encryption-based methods can make unauthorized access to data difficult, the present semantic encoding system can make such access impossible.
In accordance with the present invention, a method and system are described to allow the encoding and compression of one or more tables of data by splitting each table into two or more sub-tables, and to allow the splitting to be done using a collection of permutations and keys or key numbers, such that the original tables cannot be reconstructed from the sub-tables without knowledge of the permutations and keys or key numbers. A table is split along its columns into two or more sub-tables. The numbering of the rows in the sub-tables is permuted according to an equation containing permutations and keys or key numbers. For certain kinds of tables, an interconnection table containing permuted row numbers is formed. The sub-tables and the interconnection table are optionally padded with misleading rows. The process of splitting, permuting, forming an interconnection array, and padding may optionally be repeated on the sub-tables, and so on. An authorized user or program that knows the permutations, the keys or key numbers, and how they are combined in equations, can efficiently and correctly query and update the sub-tables and the interconnection table(s), and can efficiently and correctly reconstruct the original table. An unauthorized user or program that does not know the permutations, keys or key numbers or equations, can optionally be prevented from obtaining any rows of the original table. If the encoding is configured to allow an unauthorized user or program to obtain, amongst many others, some of the rows of the original table, that user or program still cannot tell which are the correct rows and which are not. An unauthorized user or program cannot know what effect any updates he or it makes will have on the data seen by authorized users or programs; that is, an unauthorized user or program cannot reliably insert misleading data and cannot selectively delete chosen data.
The present invention, both as to its construction and its method of operation, together with additional objects and advantages thereof, will best be understood from the following description of specific embodiments when read in connection with the accompanying drawings.