1. Field of the Invention
The present invention relates generally to a computer system and method of data analysis including pattern matching systems, and more particularly, to a neural network pattern matching system used to determine data similarities.
2. Description of the Related Art
The process of examining repository of electronic information, such as a database, to ensure that its contents are accurate has been used by audit and industry-specific consulting firms over the past decade. It has been extensively applied in the area of litigation support databases, where the information contained in an electronic database is used to support expert testimony or arguments that pertain to a case. The need to have the contents of a litigation support database accurate and complete is crucial, since inaccurate data could lead to imprecise conclusions which will result in either false expert testimony or unfavorable litigation results.
Likewise, most databases used in commercial environments, either for internal purposes (such as inventory databases) or for external sale (such as mailing lists, Dun & Bradstreet lists, phone lists, etc.), are generally acknowledged to be less than 100% accurate. The importance of knowing the accuracy of a database is crucial, since knowing this enables the user of the database to determine the extent it can be relied upon for critical decisions.
The definition of accuracy with regard to a database refers to the number of correct entries as a percentage of the population of entries in a database. For example, if a database has 10 records, each record consisting of fields, the population of entries in the database is 50. If 45 entries are correct, then the accuracy of the database is 90% (45/50*100%=90%). The accuracy of a database is often referred to as the accuracy rate. The converse of the accuracy rate is the error rate, which is simply 100% minus the accuracy rate.
The most precise way to determine the accuracy of a database is to review each and every field within each IS and every record in the database. However, in virtually every real-life situation the cost and time to review an entire database is prohibitive. Instead, a conventional technique is to request a professional trained in information audits to determine the accuracy of a database. The professional selects a subset (called a sample) representative of the database, reviews the sample to determine its accuracy, and extrapolates the results to the entire database. The procedure followed by the professional essentially comprises the four following steps.
In a first step, the professional applies conventional statistical formulae to determine an appropriate size of the sample. The size is determined so that, with a specified level of accuracy, the sample will exhibit all the characteristics and peculiarities of the entire database, including its accuracy rate.
Specifically, the sample size is derived by specifying three variables--target accuracy, minimum accuracy and confidence level--and applying these variables to standard statistical formulae. The target accuracy is the accuracy level desired to be achieved. The minimum accuracy is the accuracy level acceptable as adequate. The confidence level is the level of precision of the process of determining the accuracy.
In addition to deriving the sample size, the standard formulae also derive the reject level, and P0 and P1 values. The reject level is the threshold number of errors that the sample can have before it is deemed as not having met the target accuracy requirement. P0 is the probability that the sample happens to be particularly inaccurate when in fact the database as a whole happens to be more accurate than the sample might indicate. P1 is the probability that the sample happens to be particularly accurate when in fact the database happens to be less accurate than the sample might indicate.
In a second step, the professional randomly selects items from the database until the number selected equals the sample size determined. The items are commonly fields of the database, selected in groups of entire records.
In a third step, the professional determines the number of inaccuracies in the sample. This is done by comparing the contents of the records selected for the sample with the original material that was used to fill the records in the first place.
For example, if the database contains information that represents invoices, then the professional compares the contents of the database records of the sample with the actual information on the invoices. If the database is a mailing list, and it consists of the name, address and telephone number fields, one way to verify the contents of the database record is to try to call the party represented by the record and see if the name, address and phone number are accurate. In general, any data entered into a database must come from an original source, and validating the contents of a database means going to that source and double-checking the contents of the records selected for the sample with the original material.
In a fourth step, the professional determines the accuracy of the sample, and then extrapolates the results to the entire database as follows. When all the records in the sample have been checked against the original source material, the total number of errors is tabulated.
This tally is a number between 0 and the sample size. By dividing the number of errors by the sample size, the professional derives the error rate. By dividing the number of accurate data items by the sample size (or by subtracting the error rate from 100) the professional derives the accuracy rate.
Because the sample records are assumed to represent the entire database, the accuracy of the database is assumed to be the accuracy of the sample, and hence the accuracy of the database is derived.
General Deficiencies
It has been discovered that a substantial deficiency with the current method is that it relies too heavily on the skill and experience of select professionals who know how each step is implemented, the sequence of steps, as well as how the results of one step must be properly applied to initiate the next step. The cost of employing such professionals effectively renders this service unavailable to the majority of database users. Further, even for skilled professionals in this area, it has been discovered that there is no tool to adequately manage database audits. Accordingly, the process is inefficient and non-standardized across various database audits, and hence "unscientific".
Additionally, there are various deficiencies with the approach, method and assumptions used by the professional in conducting database audits according to the conventional technique. Some of these deficiencies can potentially result in inaccurate audits or inaccurate conclusions based on an audit. Other deficiencies highlight the inefficiency of the current method for conducting database audits. Still other deficiencies limit the usefulness of the audit results. These deficiencies will be explained in connection with the four-step outline described above.
Deficiencies with Second Step
A conventional approach for carrying out the second step of selecting samples regards empty fields as equal items to sample as filled fields (i.e., fields with values). However, it has been recognized that the value of a database is generally the information that it does contain, rather than the information that it is lacking. Accordingly, it has been discovered that an audit which includes empty fields is in many instances incorrect, since what users generally want to know is the accuracy rate of the information being used, i.e., of filled fields.
Additionally, it has been realized that errors are less likely to occur in empty fields than in filled fields. Since errors in a database are generally the result of human oversight during the data entry process, it has been discovered that it is more common to enter an incorrect entry than to inadvertently leave a field completely empty. Also, many fields in a database may intentionally be left empty for future consideration.
When included in an audit, such fields are guaranteed to be correct. Therefore, it has been discovered that the results of audits that include empty fields generally overstate the accuracy of the database.
Another deficiency with the conventional second step (selecting the sample) is the inability to adequately handle focus groups of fields. Focus groups are different sets of fields that have different degrees of importance within the database. For example, in a mailing list database, the name, address, and telephone number fields often need to be very accurate so that mailings arrive at their intended destination. However, it has been discovered that fields such as the extended industry code (full SIC code) or secondary business interest fields are often of less significance. Since the target accuracy rate may be different for different groups of fields in the database, and the sample size is directly related to the target accuracy rate, it has been discovered that the appropriate sample sizes for these different focus groups will be different.
In order to conduct an audit on multiple focus groups of fields, often separate audits are conducted, one for each focus group. This is extremely time-consuming and very inefficient. To make such an audit efficient, some professionals select the focus group with the largest sample size as a basis for selecting the sample for all focus groups. The assumption is that using the focus group with the largest sample guarantees an adequate sample for the remaining focus groups.
However, it has been discovered that this assumption is not always correct. For example, it has been discovered that the sample selected may not be adequate if it contains many empty fields. Also, it has been discovered that the assumption may not be correct if the focus group with the largest sample happens to include fields that are heavily populated, while the remaining focus groups have fields that are sparsely populated. In that case, it has been discovered that in selecting enough records to complete the group with the largest sample there will not be enough filled fields collected to complete the samples for the remaining focus groups.
For example, suppose there are two focus groups: Group A which requires 10 items to sample and Group B which requires 5 items to sample. Further suppose that both Group A and Group B represent five fields of a 10 field database. The conventional method of selecting one sample that will meet both samples' requirements is to take a sample based on Group A, which is larger, and assume that Group B's requirement of fewer fields to sample will automatically be met. Suppose all of the fields of Group A are filled, and only one of the five :fields per record are filled in Group B. Further suppose that empty fields should not be selected for the sample. Based on Group A's sample requirement, only two records will be necessary to complete the sample (2 records*5 filled fields/record=10 items to sample). However, Group B, which has a lower sample size requirement will need 5 records to complete its sample in this case (5 records*1 filled fields/record=5 items to sample). Accordingly, it has been discovered that the assumption of selecting a single sample for all focus groups based on the group with the largest sample is therefore not valid.
It has also been discovered that another problem with the conventional method for selecting the sample is the general disregard of, or incapability of selecting, filtered or skewed samples. A filtered sample is one that ignores certain field values or only includes certain field values in the audit. A skewed sample is one that emphasizes certain field values, but does not limit the audit to just those values.
A filter is used to ignore fields or records for the audit that are not going to be used as a basis for a decision. A "clean" audit considers only the information in the database that is of interest.
A skew is used if all the information in a focus group is needed as a basis for a decision, though some information within the focus group is of greater value than other information. Since the information contained in the entire focus group will be used as a basis for a decision, it has been discovered that a typical filter is not appropriate. (A filter would eliminate some of the information from the audit.) A skew will bias the audit toward including records that meet the skew criteria, though records having fields of any value could be selected to audit. A skew is typically used in auditing financial data, where records representing large transactions are of greater interest to financial auditors than those representing small transactions, though since a financial audit must be comprehensive, all is transaction records should have at least the potential of being reviewed.
Because it has been discovered that the conventional database audit technique generally does not support a variable within an audit (i.e., a focus group, filter or skew), let alone any combination of variables, its result often reflects the accuracy of more information than what will be used as a basis for a decision. As a result, it has been discovered that extraneous information that is not of specific interest at the time of the audit may render the result misleading or inaccurate.
Deficiencies with Third Step
A conventional approach for carrying out the third step of reviewing the sample is for the professional to create a report of the data contained in the records that comprise the sample. The source material that represents the sample records are retrieved separately. The professional then compares the contents of the records with the source material that the records represent, and notes any discrepancies.
The primary deficiency with this approach is one of inefficiency caused by a lack of standards. There is no set method, and no automated system, that can take the sample records selected in the prior step and print them to standardized reports that are usable for any audit. Therefore, this step is typically done by separately programming a series of reports for each audit of a different database.
Deficiencies with Fourth Step
The conventional approach for carrying out the fourth step of calculating the result of an audit only determines: (1) whether the audit met a target accuracy specification, and (2) the actual accuracy of the database. While this information is useful when deciding the extent the information contained in the database can be relied upon for critical decisions, it has been discovered that this information does not guide the user of the database as to how the accuracy rate of the database can be improved.
Therefore, what is needed is an approach to auditing databases wherein a user can conduct an audit by focusing specifically on the contents of a database that will be the basis of a decision. The approach must be able to handle empty fields, focus groups, filters and skews correctly. Further, the approach must be standardized so that audits can be conducted in a uniform way across databases, with only the specific focus and statistics varied as needed for each audit. Finally, an apparatus is needed to enable a typical database user, untrained in the skill of database audits, to independently conduct a database audit and to manage various audits.
In addition to the above need to audit a database using an approach that can handle empty fields, focus groups, filters and skews correctly to indicate the overall probability of database accuracy, there is also a need to determine which data stored in the database is in fact inaccurate. That is, there is the need to correct database inaccuracies. For example, most information that is entered into computer systems today is done so by humans, i.e. data entry operators. Even the best data entry operators make data entry errors. One category of error that is often very costly to organizations and almost impossible to effectively find and correct is what is referred to as "duplicate data". The "duplicate data" error occurs when a single piece of information is inadvertently entered into a computer system two or more times in the exact or varied form.
A costly example of this type of error can be found in most corporate accounting systems. As invoices are entered into a corporate accounting system, the same invoice can accidentally be entered two or more times. This can occur either because of a data entry error, i.e. a data entry operator accidentally enters the same data twice, or because the company has been "double-invoiced", that is, invoiced twice for the same product or service. The second invoice is then entered by a data entry operator as if it were the only invoice for the particular product or service. If these "duplicate" invoices are not found, the company pays two or more times for the same product or service. In large companies this problem can cost millions of dollars.
To find "duplicate data" in computer information systems, MIS departments have relied on a combination of algorithmic-based and application-specific solutions. To find "duplicate-data" an Management Information System (MIS) department will learn the style and parameters of the data, and then devise an algorithm to search for specific patterns to find duplicates.
An example of the conventional method of finding "duplicate data" is the way MIS departments typically deal with "duplicate" invoices. Invoices that are from the same company typically follow a certain pattern, such as "ABC100", "ABC101", etc. To find duplicate invoices a special program is created to search for invoices that match on the first several letters. This will produce a listing of all invoices that start with the same set of letters and vary on the remaining letters. A human then reviews the listing and determines which invoices are in fact "duplicates". The primary goal of this method is to find actual duplicates, i.e., invoices with the identical invoice number.
This method of finding "duplicate data" is basically useful in finding exact duplicates. However, it has been discovered that "duplicate data" can be found in a system in a variety of forms that are not identical. FIGS. 17a-17e illustrate, in accordance with the discoveries of the invention described above, the various ways the same data can be entered into a system and still be considered duplicate data, i.e., data which has been entered two or more times identically or in varied form. FIG. 17a illustrates an example of original data. In addition to exact duplicates, the same data can be entered two or more times with any combination of the following types of variations:
Misspelled Letters--a letter is entered incorrectly (FIG. 17b) PA1 Additional Letters--an extra letter is accidentally inserted (FIG. 17c) PA1 Missing Letters--a letter is accidentally left out (FIG. 17d) PA1 Transposed Letters--one letter is accidentally exchanged with another letter (FIG. 17e)
In each of these cases, if the error occurs on the first letter then it has been discovered that the method of finding duplicates by matching on the first several letters will fail. Further, there are a near-infinite combination of errors when these data entry errors described above are combined, so that no matter what algorithm is devised for searching for duplicates, it is guaranteed not to find all errors.
Additionally, for every case where an MIS department suspects that "duplicate data" is a costly problem that must be corrected, a new program must be devised to find these duplicates. This is because the conventional method uses traditional algorithm-based solutions for finding "duplicate data" that is application specific. The pattern that is searched for one system and application may not be relevant to another system or application.
In summary, the conventional method is not only inefficient because it must be rewritten for each instance where "duplicate data" needs to be eliminated, it is ineffective at actually finding most "duplicate data".