The invention concerns a method and apparatus for checking whether a new record to be added to a database is a duplicate of an existing record.
FIG. 1 illustrates a simple table which exists in a hypothetical database of a bank. The table lists four types of information, arranged in columns: (1) the cities in which the bank branches are located, (2) the total assets, or deposits, of each branch, (3) the customers who maintain accounts at each branch, and (4) the balance of each account.
During operation of the bank, entries within a row will change. (A row is also sometimes called a xe2x80x9crecord.) For example, if fifty dollars is deposited to the WILSON account, the ACCOUNT BALANCE will be changed to $150.
An entire row may change, as when it is deleted. For example, the row xe2x80x9cANTIOCH, 1000, WILSON, 100xe2x80x9d may be deleted when the Wilson account closes. Conversely, a row may be added when a new customer opens an account.
Some types of databases do not allow a new row to be added if the new row contains information which is identical to that contained in an existing row. For example, if a new customer named UNSER wishes to open an account at the ANTIOCH branch by depositing 75 dollars, a duplicate row would be created. However such a situation is illegal, as indicated in FIG. 2.
The duplicate row can create several problems. For example, if the first UNSER wishes to close the account, the question arises, Which row should be deleted? As another example, an uninformed observer may view the duplicate row as a mistake, and presume it to be a duplicate of the first UNSER""s data, when, in reality, it represents the account of a second UNSER.
Several approaches are available to prevent this duplication. In one approach, when a new row is to be added, all rows of the database are examined, and compared with the new row. If the examination finds that the new row matches no existing row, the new row is added.
However, this approach is time-consuming. For example, assume that a fresh database is created, and contains a single row. When a second row is added later, a single comparison is required, between the second and first row. Addition of the third row requires two comparisons. In general, the number of comparisons is proportional to the number of existing rows, as indicated in FIG. 3.
However, the total number of comparisons performed since creation of the database is a square-law function of the number of rows, as indicated in FIG. 4. Viewed graphically, the total number of comparisons, past and present, equals the area of the hatched triangle. The area of the triangle equals (xc2xd)xc3x97(no. of rows)**2. If one million rows are present today, then a total of 5xc3x9710**11 comparisons have been made so far, in adding a new row today.
These comparisons are time-consuming.
An object of the invention is to provide an improved database management system.
A further object of the invention is to provide an improved system for preventing duplication of rows in a database.
A Bloom Filter is generated, based on the database. When a new row is to be added, the Bloom Filter is consulted to determine whether the new row duplicates an existing row. If duplication is not found, the new row is added.