This invention in general relates to processing data and specifically relates to deduplication and grouping of similar supplier names in a database.
When an organization procures a product or service, typically the entry is made into a table called a spend table. The spend table contains the product or service description, spend amount, date, supplier name etc. For the analysis of spend data, there is a need to identify a particular supplier or find the list of top ‘n’ suppliers. Since all entries in the table are normally entered manually; often the same supplier may inadvertently be entered in a different manner at different places in the table. Such duplicated entries result in aggregation of redundant data by having multiple accounts of the same supplier. Proportions of duplicative data and the associated problems increase with the increase in the number of entries in the spend table. Manual grouping is labor intensive and time consuming. Furthermore, once the grouping is finished, the table is outdated. When fresh entries are added to the table, they are not grouped during entry. Hence, even after all the effort put into manual grouping, the problems of duplicative data still prevail.
To ensure that duplication of a supplier's name does not occur, it is necessary to check the supplier's name against all the name entries in the spend table. If there are ‘n’ supplier names, then to ensure non-duplication of the ‘n’ names it maybe necessary to check each supplier's name with all the remaining supplier names. This checking technique requires considerable time because of the n squared matches.
Hence, there is a need for a computerized system that automatically handles the above stated problems and effectively groups similar supplier names. The system has to address all the inadvertent mistakes that are due to typing or spelling. The system, also, should not group the supplier names which seem to have small differences but are dissimilar in reality, for example KBC TV, NBC TV. The system has to improve the deduplication and achievable grouping performance in addition to reducing time taken for the same.