1. Field of the Invention
The present invention generally relates to data processing and more particularly to protecting sensitive data in underlying databases.
2. Description of the Related Art
Databases are computerized information storage and retrieval systems. A relational database management system is a computer database management system (DBMS) that uses relational techniques for storing and retrieving data. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways.
Regardless of the particular architecture, in a DBMS, a requesting entity (e.g., an application or the operating system) demands access to a specified database by issuing a database access request. Such requests may include, for instance, simple catalog lookup requests or transactions and combinations of transactions that operate to read, change and add specified records in the database. These requests are made using high-level query languages such as the Structured Query Language (SQL). Illustratively, SQL is used to make interactive queries for getting information from and updating a database such as International Business Machines' (IBM) DB2, Microsoft's SQL Server, and database products from Oracle, Sybase, and Computer Associates. The term “query” denominates a set of commands for retrieving data from a stored database. Queries take the form of a command language that lets programmers and programs select, insert, update, find out the location of data, and so forth.
One significant issue in the context of databases is security. Databases often contain confidential or otherwise sensitive data which requires some degree of security to be protected from inappropriate and unauthorized access. For example, medical records contained in a medical data warehouse are generally considered highly personal and confidential. As such, access to medical records is typically restricted using various security mechanisms.
One of the mechanisms used to restrict access to data is the application of an output formatter. In general, an output formatter is a kind of filter that operates to selectively remove data from a result set. For example, an output formatter may be configured to remove a particular column from the result set. Alternatively, the output formatter may be configured to remove a value from selected cells in the result set.
Regarding the latter configuration, there may be several reasons for wanting to remove the values of selected cells from a given result set. For example, consider a researcher querying data in a medical data warehouse of a medical institution having sensitive data with respect to patients of the institution. The sensitive data might include HIV test results. While a column level filter could be applied for the HIV Test field, this may prevent the researcher from performing his/her study. Accordingly, a more granular value level filtering technique is required. For example, the basis for restriction may be the celebrity status of a given individual. Thus, while the researcher may be given access to HIV test results of the non-celebrity general public, his/her access to celebrity test results may be restricted because of the high degree of liability facing the institution should such information inadvertently be leaked. Accordingly, for a query that includes HIV Test as a result field, a given result set my appear as follows:
Patient IDNameHIV Test2Paula Superstarnull4JoeNegative8JaneNegative9Joe Celebritynull12AlexPositive
Assume that Joe, Jane and Alex are non-celebrity individuals, and that Paula Superstar and Joe Celebrity are well-known celebrity figures recognizable by the researcher. Accordingly, the output formatter has removed the actual value for the cells in the HIV Test fields for Paula Superstar and Joe Celebrity and replaced it with null values. In the result set above the patients are arranged according to their patient ID. Therefore, the researcher can make no conclusion about the actual value of the HIV Tests for Paula Superstar and Joe Celebrity. However, if the researcher runs the same query again but adds a command to sort the results based on the HIV Test field and to further sort based on the alphabetical order of the names, the arrangement of the results may be as follows:
Patient IDNameHIV Test8JaneNegative4JoeNegative2Paula Superstarnull12AlexPositive9Joe Celebritynull
Now, the researcher can conclude that the HIV test value for Joe Celebrity is positive and that the HIV test value for Paula Superstar is negative. This conclusion is based on the observation that Paula Superstar's record occurs before Alex's record and that Joe Celebrity's record occurs after Alex's record and Alex's HIV test value is positive.
Therefore, there is a need for improved security mechanisms for protecting sensitive data from inappropriate and unauthorized access.