Computing devices and distributed computer systems are widely used to store and maintain large amounts of data. Computing devices that are primarily responsible for managing data are usually referred to as databases. Databases may also be implemented as software components running on non-specialized computing devices. Additionally, software databases sometimes interact with devices equipped with large amounts of physical memory or may be implemented in a distributed fashion by storing data in a plurality of physical devices. Data is usually added to, modified, or removed from a database by human operators or by software components. Database designers and administrators, software developers, and other database users usually organize data into a finite set of categories based on the characteristics of data stored in the database. In particular, relational databases provide the methodology for organizing data into logical formations, or tables, which consist of records. Records may be further broken down into fields, some of which may additionally serve as logical connections between database tables.
With the increase in computing power, many industries and organizations have come to rely on computerized databases not only for data storage but also for statistical analysis of the stored data and for other automated operations related to the patterns identified in the stored data. As a result, the importance of obtaining data which is accurate and up to date has further increased. Moreover, the growing number of database interfaces has provided more methods of populating and updating databases. Meanwhile, these additional sources of data invariably produce additional sources of errors in the data. With the advent of the World Wide Web, the sources of data that may be stored in the database have become almost limitless, making the issue of data accuracy more relevant than ever before.
Users of the World Wide Web distributed computing environment may freely send and retrieve data across long distances and between remote computing devices. The Web, implemented on the Internet, presents users with documents called “web pages” that may contain information as well as “hyperlinks” which allow the users to select and connect to related web sites. The web pages may be stored on remote computing devices, or servers, as hypertext-encoded files. The servers use Hyper Text Transfer Protocol (HTTP), or other protocols to transfer the encoded files to client users. Many users may remotely access the web sites stored on network-connected computing devices from a personal computer (PC) through a browser application running on the PC.
The browser application may act as an interface between user PCs and remote computing devices and may allow the user to view or access data that may reside on any remote computing device connected to the PC through the World Wide Web and browser interface. Typically, the local user PC and the remote computing device may represent a client and a server, respectively. Further, the local user PC or client may access Web data without knowing the source of the data or its physical location and publication of Web data maybe accomplished by simply assigning to data a Uniform Resource Locator (URL) that refers to the local file. To a local client, the Web may appear as a single, coherent data delivery and publishing system in which individual differences between other clients or servers may be hidden.
A system may provide web site proprietors with web site user demographics information and is generally described in U.S. application Ser. No. 09/080,946, “DEMOGRAPHIC INFORMATION GATHERING AND INCENTIVE AWARD SYSTEM AND METHOD” to Bistriceanu et al., the entire disclosure of which is hereby incorporated by reference. Generally, the system may include users, web site proprietors, and an enterprise system hosting a central, web site. The users may register with the central web site and may earn “points” for performing specific on- or off-line tasks in exchange for disclosing their demographic information during registration. The users may then redeem their earned points at participating proprietors for merchandise or services. Generally, the central web site manages the system by performing a number of tasks including: maintaining all user demographic information, tracking user point totals, and awarding points according to specific, proprietor-defined rules.
The system described in the above-referenced application may employ either a centralized or a distributed database to maintain the information related to users, tasks, proprietors, and services. Data may be arrive at the database from a variety of sources, such as users of the system, system administrators, automated processes servicing the system, web site proprietors, and others. Moreover, some data, such as real time updates, may arrive as individual entries and some data may arrive in bulk as part of large files as part of periodic updates. Clearly, while an individual entry containing erroneous or corrupted data may have some negative impact on the operation of the system, storing large amounts of faulty data in the database may have very dire consequences. For example, large amounts of faulty data may cause the system to award too many points to a significant number of users leading to large financial losses or, conversely, to award an insufficient amount points to a significant number of users resulting in low customer satisfaction. Moreover, statistical analysis may be further impaired by taking into account inaccurate data.
Several methods of assessing the quality of data in a database are known in the art. For example, credit card companies typically detect patterns in credit card usage by a particular card holder. In accordance with a method referred to as “velocity checking,” credit card companies attempt to detect excessive quantities, such as rates or amounts, for a particular card holder and flag the associated transactions. However, this method of detecting abnormal behavior or erroneous reporting can only be applied to individual members.
On the other hand, a method known as “straightlining” is used in the market research industry to detect fraudulent behavior. Because some survey respondents fill out questionnaires with a made-up or otherwise false data by picking the first answer on the survey, for example, the method of straightlining may be used to detect and discard those surveys that appear dishonest. However, this method is similarly limited because it can be applied on a survey-by-survey basis.