A business enterprise tracking its sales will easily acquire large amounts of repetitive-type data. An example of such data is represented in the table shown in FIG. 1. In this representation, the fields “TRANSACTION NO.,” “TRANSACTION DATE,” “CUSTOMER,” “SALESPERSON,” and “SALES PRICE” of three records are visible (those of transaction numbers 1-3), but there are many more fields and records.
The data in FIG. 1 show that two sales for an automobile part (a transmission bracket) on October 31 are for close to the same price, but a sale on the next day had a price that was significantly lower. On one hand, many factors could explain this, such as that the prices offered to the public are perhaps generally lower in November or perhaps that particular customer was granted a discount due to a special relationship with the vendor.
On the other hand, an auditor unfamiliar with a policy of offering lower prices in November or a special relationship with a particular customer may suspect that the discounted sales price was a result of an impropriety. For example, the associated salesperson many have improperly granted the discount accidentally, as a result of misinformation about the customer's status or about a promotional sale, or even on purpose. Auditors, forensic investigators, and data-quality managers familiar with the business may with experience develop a sense for recognizing potentially fraudulent data or data indicative of a mistake resulting from incorrect data entry, staff members misunderstanding operating guidelines, etc.
As the amount of data grows, though, it becomes more difficult to catch fraud and error by human inspection, and accordingly automated tools have been developed to recognize suspicious data. One way to recognize potential frauds and errors begins by generating a report of filtered and sorted data. An example report could be based on a rule to list all transactions in which the discount in a sales price is above a certain threshold, such as 40 percent. When an auditor sees transactions flagged as having discounts greater than 40 percent, he can decide whether to investigate whether those are proper transactions.
However, a limitation of such a technique to find frauds and errors is that only a pre-determined type of frauds or errors can be found. That is, in the previous example, improper discount of greater than 40 percent was the subject of investigation only because a human decided to establish a rule to check for such a high discount. Consider the more complex situation that discounts up to 40 percent are acceptable in November, but only if the salesperson V. McCall (FIG. 1) granted the discount. More specifically, salespersons R. Cohen and T. Nguyen are not authorized to grants discounts in November that are greater than 20 percent.
Accordingly, in the example, the rule to flag discounts greater than 40 percent will be ineffective for catching a November transaction by salesperson R. Cohen, if the discount for one of her transactions was 30 percent. That is, until a human decides to establish another rule that will flag November transactions of R. Cohen having discounts greater than 20 percent (as opposed to 40 percent), her transactions having a 30 percent discount will not be flagged for suspicion of an impropriety.
To ease the burden of having to think of and to establish so many rules to flag as much suspicious data as possible, data-mining algorithms have been developed to find patterns in identified improper transactions and then to establish rules based on those patterns. That is, patterns found to be associated with known improper transactions are used to find improper transactions in new data. With reference to the examples above, data regarding the improper transaction of R. Cohen may be input into a data-mining algorithm (along with other data of that type), and then the algorithm can establish a rule to flag similar transactions. With such algorithms, transactions need only be indicated as improper, and then the algorithm can determine the appropriate rule to find similar improper transitions. (Although the examples described here relate to simple rules that a human might establish with minimal difficulty, the algorithm can determine effective rules that are not so simple to discern by humans responsible for large amounts of data.)
Although such data-mining algorithms ease the human burden by developing analysis rules given improper transaction data as input, their effectiveness is limited to only establishing rules based on transactions that are identified as improper. That is, the data-mining algorithms do not establish rules for flagging suspicious transactions that have no examples identified previously. There still needs to be a way to analyze data for irregular patterns, even patterns that have not in the past already been associated with improper transactions.
To a limited extent, data-mining algorithms have been developed to analyze transaction data for patterns, to establish rules based on those patterns, and then to flag as suspicious transactions do not conform to the rules. Such algorithms though are inherently based on the assumption that an impropriety, whether based on fraud or unintentional error, is an exception to a rule and does not conform to the established pattern of the majority. For example, if 97 percent of the November transactions were discounted by no more than 25 percent, a single transaction made by salesperson V. McCall with a discount of 30 percent would likely be flagged as suspicious.
However, if instead 25 percent of the November transactions of V. McCall were discounted by 30 percent, those transactions would not as likely be flagged as suspicious. That is because, although the transactions may very well be improper, there are so many of them that they do not appear to the algorithm as suspicious. In fact, the algorithm may instead even establish a rule based on transactions by salesperson R. Cohen that considers the transaction discounted by 30 percent to be an acceptable transaction. The present inventors are unaware of any conventional way to flag suspicious patterns of data, even if the suspicious activity is prevalent.