In the domain of information integration, data quality is important for master data management, data warehousing, system consolidation projects, etc. Information integration identifies duplicate data records, either within a source or across multiple data sources. Also, some data may need to be stored and/or processed securely.
Various data masking and encryption techniques may be applied to secure the data. For example, alteration techniques shuffle the order of values within a column, but leave the original values untouched. This retains data quality issues, such as typos on the individual values. As another example, A Secure Hashing Algorithm (SHA)-2 is a hashing function that is used to provide encryption of data. SHA-2 is currently considered collision-free, which means that two different input values are mapped to two different output values. As a consequence, a data quality metric like uniqueness can still be checked for on SHA-2 encrypted values because if the source value set was unique, due to the collision-free characteristic, the encrypted value set will be unique as well. However, on the encrypted data, information related to typos and other data quality issues in the data may be lost.
Some systems use probabilistic matching procedures. For probabilistic matching, scores for two or more records are computed, measuring how similar they are. The matching process usually considers multiple attributes, and each of the attributes has a different weight regarding the outcome of the overall score (e.g., a date of birth field may have more weight then a middle name field due to its higher significance if the same/similar value is found across two records). The weight assignment for each attribute is part of the configuration of the matching procedure. For each comparison of an attribute across two or more records, rules can be specified such as:                Ignore x number of typos (e.g., Labt instead of LbaT is treated as the same if one typo is permitted, and, thus, would yield exact match for this attribute)        Compare value of field on UPPERCASE/lowercase representation only        For date fields with US and European date formats of MM-DD-YEAR versus DD-MM-YEAR, consider values in date attributes as being the same if switching from US to European date format (or vice versa) would make the date fields look the same, which means exchanging the order of DD and MM in the overall value.        
Some systems address the fuzzy matching logic problem on encrypted keyword data by implementing a limited wildcard character syntax for a given keyword value, where each conceivable wildcard permutation of a given keyword is encrypted and placed in an index (e.g., cat, c % t, ca %, % at). This approach dictates that each keyword and all wildcard permutations are to be generated, encrypted, and indexed prior to a search, since each permutation of a given word is indexed.