This description relates to fuzzy data operations in the field of data management.
Data operations, such as clustering, join, search, rollup, and sort, are employed in data management to handle data. Clustering is an operation that classifies data into different groups. Join combines two pieces of data together. Search by a key finds data entries that match that key. Rollup is an operation that calculates one or more levels of subtotals (or other combinations) across a group of data. Sort is an operation that orders data.
Data quality is important in data management. Mistakes or inaccuracies resulting from data operations degrade data quality. For example, classifying an employee of Corporation ABC, John Smith, as a temporary worker or a permanent worker entitles John Smith to a different level of benefits. Erroneous classification of John Smith's employment status, e.g., mistakes in data operation clustering, affects the quality of Corporation ABC's human resource data.
Some implementations of data operations rely on exact comparison of field values (“keys”) to identify matching records, to define groups of related records or to link records. When data is ambiguous, imperfect, incomplete, or uncertain, methods based on exact comparison of field values may break down.
When there is an inherent ambiguity associated with a data operation, for example, clustering, one approach to resolve the ambiguity may be simply to ignore the ambiguity and to force a piece of data into a particular group. For example, the employee of Corporation ABC, John Smith, works for both the marketing department and the R&D department. In Corporation ABC's human resource database, John Smith may be associated with either the marketing department or the R&D department, but often is associated with just one department. The forced classification of the piece of data into a particular group may mask the inherent ambiguity and adversely affect data quality.
When there is an uncertainty associated with a data operation, for example, clustering, because of a pending outcome of an event, for example, a legal dispute between entity A and entity B that involves the ownership of a piece of an asset, forcing a piece of data into a particular group may not be the best approach to address the fluidity of the situation. Prior to the adjudication, the ownership of the asset is uncertain. Assigning the asset to either A or B may turn out to be inaccurate.
When there is an uncertainty associated with a data operation, for example, rollup, because of an ambiguous identification of group membership, assigning membership to one group among several alternatives to preserve accounting integrity may give a misleading picture. For example, a bank may be interested in determining its exposure on loans to counterparties for risk assessment and regulatory purposes. Identification of a counterparty is often made by company name, which may lead to ambiguous identifications because of wide variability in the recorded form of a company's name. In turn, this means assignment of loan exposures to counterparties is ambiguous. It may happen that loans properly associated to one company become divided among several apparently distinct companies, which actually are simply variant forms of the name of the one company. This results in understating the exposure of the bank to any single counterparty. Alternatively, if an arbitrary selection among alternatives is made, an exposure may be falsely assigned to one counterparty when it properly belongs to another, perhaps overstating the exposure to the first and understating it to the second.
When there is an uncertainty associated with a data operation, for example, join, because of incorrect or missing information, forcing a piece of data into a particular group or ignoring the piece of data may result in either a false association or loss of information. For example, when attempting to join tables from two different databases, there is often no common key shared by the database tables. To overcome this, data within the tables, e.g. customer address, is used to infer a relation between records in the two databases. Address information may however be incorrect or incomplete. Suppose address validation against a definitive reference set, like a Postal Address File, shows the house number on a record in table A is invalid (no house exists with that house number) while there are multiple addresses in table B which might be valid alternative completions of the address. Arbitrarily choosing a completion of the address in the record in table A may lead to a false association while ignoring the record leads to loss of information.
When there is an ambiguity associated with a data operation, e.g. search, because of inaccurate data entry, one approach is to propose a single alternative or a simple list of alternative corrections. If this is part of the validation process of data being entered into a database by an operator, a single alternative when multiple alternatives exist may lead the operator into a false sense of security in accepting the correction. If a simple list of alternatives is provided, the operator may have no rational basis for choosing among the alternatives. If a single choice is required and some degradation of data quality is accepted for a wrong choice, then minimizing and quantifying the possible loss of data quality becomes the objective.