1. Field of the Invention
The present invention is generally related to data processing, and more specifically to processing data retrieved from a database.
2. Description of the Related Art
Databases are computerized information storage and retrieval systems. A relational database management system is a computer database management system (DBMS) that uses relational techniques for storing and retrieving data. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. A distributed database is one that can be dispersed or replicated among different points in a network. An object-oriented programming database is one that is congruent with the data defined in object classes and subclasses.
Regardless of the particular architecture, in a DBMS, a requesting entity (e.g., an application or the operating system) demands access to a specified database by issuing a database access request. Such requests may include, for instance, simple catalog lookup requests or transactions and combinations of transactions that operate to read, change and add specified records in the database. These requests are made using high-level query languages such as the Structured Query Language (SQL) and application programming interfaces (API's) such as Java® Database Connectivity (JDBC). The term “query” denominates a set of commands for retrieving data from a stored database. Queries take the form of a command language, such as SQL, that lets programmers and programs select, insert, update, find the location of data, and so forth.
Any requesting entity, including applications, operating systems and, at the highest level, users, can issue queries against data in a database. Queries may be predefined (i.e., hard coded as part of an application) or may be generated in response to input (e.g., user input). Upon execution of a query against a database, a query result is returned to the requesting entity.
For example, a medical researcher may issue queries against a database to retrieve data to support research efforts. The data may include, for example, patient records that may be used to determine the pathology for particular disorders. Patient records may include, for example, a patients' demographic data, values for administered tests, testing conditions, patient response to tests, doctor's notes, and the like. Studying the data related to a particular disorder stored in a database may allow researchers to devise adequate measures to improve prevention, diagnosis, and management of the disorder.
One problem with retrieving data for medical research is that not all data retrieved by a query may be desirable. For example, a researcher may collect data for his research from a number of sources, for example, from one or more hospitals. If a hospital does not have reliable procedures for data collection, the data may be unreliable, and therefore undesirable for inclusion in the research. For example, a hospital may use outdated equipment for conducting tests on a patient, thereby making that hospital's data unreliable and undesirable for research purposes.
Any given database may also contain invalid data that can be returned in a given query result, such as negative age values. The invalid data can be introduced into a given database due to various reasons, such as typographical errors, architectural problems with data replication and timing, mistakes in original data acquisition, and the like. Because of the invalid data, the given query result can be useless to a corresponding requesting entity that wants to further process the query result. For instance, if the researcher wants to determine an average age of patients in a hospital for which a specific treatment is suitable and the query result includes negative age values, an incorrect average value is obtained. Accordingly, some level of data cleansing is needed to ensure data consistency, accuracy, and reliability in a given database.
However, in large databases data cleansing is an expensive and time-consuming process that may require a large amount of processor resources and an even larger amount of manpower. Accordingly, data cleansing is not automatically implemented and/or frequently performed in database environments and, as a result, corresponding databases may include undesirable or invalid data. Thus, a user needs to perform a manual clean operation on each query result obtained from such a database in order to identify invalid data included therewith prior to further processing of the query result. More specifically, the user needs to perform an exhaustive examination on any data returned from the database in order to verify whether the data is valid or to execute suitable database queries that are configured to identify whether the database includes the invalid data.
Accordingly, what is needed are methods, systems, and articles of manufacture for retrieving data based on a quality of the data.