Maintaining and managing software data quality has always been a big issue for enterprises. Data and business environments do changes very rapidly, including big data, unstructured data. Data governance is also been a quite challenging. Business is looking to data as the foundation points for the key strategic initiatives within the organization. To perform meaningful activity, organizations need to ensure that the database is designed correctly and the data is accurate. Without proper data management, business cannot run effectively. There are a number of data quality issues exist in the eco system such as correctness of data entered to the system, technical issues in data processing, data analytics, mapping data as per business requirements. Therefore a system needs to be designed to interpret and address these issues.
The quality of data and accuracy becomes more critical when the data are related to names and address. While it might be easy to check names and address against a list, there are a number of practical problems in finding accurately a particular person by name and given address. Firstly, the lists may have spelling errors, abbreviations, common address type (e.g. Madrid road, Valencia stop etc. and other anomalies which make matching on the list extremely difficult. These lists may also contain a mixing up business names, individual names, common address with without PIN code and aliases. In addition to these a lot of names may originate from foreign countries, which adds even more complexity to the name matching process. For all of these reasons, it is quite difficult to determine the data mapping and matching for a given set of names and addresses.
For all of these reasons, recognition and matching of name and address are extremely difficult tasks. Exact string matches also has very limited utility as a match will not be recognized if there is any discrepancy between two or more names and address. Many relational database systems now including a “soundex” function for comparing two slightly dissimilar strings. These functions are mainly based on a “Soundex” system that was originally developed as an index filing system for grouping similar sounding names. The initial version was patented by Robert C. Russell in 1918 as U.S. Pat. No. 1,261,167. Russell's system, which also known as “soundex” or “soundexing”, used a simple phonetic algorithm for reducing a name to a four character alphanumeric code. The first letter of the code corresponds to the first letter of the last name. The remaining three digits of the code consist of numerals derived from the syllables of the word.
In Spanish language, a person's name consists of a given name (simple or composite) followed by two family names (surnames). Person bears a single or composite given name (nombre) and two surnames (apellidos). A composite given name comprises two (not more) single names; for example Juan Pablo is considered not to be a first and a second forename, but a single composite forename. Traditionally, a person's first surname is the father's first surname (apellido paterno), and the second one is the mother's first surname (apellido materno). For example:—Name:—Eduardo Fernández Garrido and ForeName:—Eduardo, Surname1:—Fernández, and Surname2:—Garrido. Further Each surname can also be composite, the parts usually linked by the conjunction y or e (and), by the preposition de (of) or by a hyphen, for example Name:—Juan Pablo Fernández de Calderón García-Iglesias and ForeName:—Juan Pablo, Surname1:—Fernández de Calderón and Surname2:—García-Iglesias.
Many database experts have implemented lot variations of the Soundex function for use in their database systems as a system and method for comparing slightly dissimilar strings. Although Soundex functions prompt users to find information based on phonetic similarities, they are well known to be too coarse for reliable name matching. In addition, the implementation may change for various database vendors.