This invention relates generally to computer implemented methods and systems for automatically standardizing data items that refer to the same thing but which appear in different distinct non-standard forms and formats in data collections, and more particularly to the automated standardization of unstructured non-standard names in big data databases.
Frequently, data items that refer to a common thing or to related entities of a group can appear in many different non-standard distinct forms in one or more collections of data. This occurs in many different areas and different kinds of data, including, for example, in transaction-related data in a relational database. Non-standard unstructured data items cause problems for automated data processing operations that run analytics on the database seeking to make associations between data items or to derive information from the data. They are particularly acute with regard to proper names that are used to identify entities, such as persons or businesses, because the names are unstructured and not standardized across single business entities. Names referring to the same entity may be spelled differently, may contain truncated or shortened words, and may contain special alphanumeric characters. For instance, a business named ACMEMART may appear in a database in many different forms such as “ACME MART”, “ACMEMART, INC.” or “Acme-Market”, and if the business has stores in different locations or has separate departments in stores that are separate cost centers, each store or department may be designated separately, e.g., “ACMEMART #0267”.
For analytics processing that requires a global view of an entire business, individual level labeling (non-standard naming) is limiting and requires pre-processing of the data to identify and aggregate the various names into a standardized format for processing. Typically, preprocessing is a manual operation to verify name assignment and correct for obscure or outlier differences. For small and conventional sized data sets, and for data that is not rapidly changing, this may be practical. However, for “big data” sets, and particularly for transactional data, such pre-processing is burdensome and may be impractical if not impossible. “Big Data” refers to large complex collections of data sets having a volume, velocity, and a variety that exceeds an organization's traditional storage or computing capacity for accurate and timely decision making. For some organizations, big data may be data exceeding hundreds of gigabytes. For others, it may be tens or hundreds of terabytes. Big data is difficult to work with using most relational database management systems and statistics and visualization packages. Instead, it may require massively parallel software running on tens, hundreds, or even thousands of servers.
It is desirable to provide systems and methods that can preprocess and automatically standardize data items, such as names, in a database to associate data items having distinct non-standard forms and formats to a common standard format so that the data can be aggregated, queried and analyzed to determine relationships and characteristics among standardized groups of data items. More particularly, it is desirable to afford an automated name standardization system and process that may be applied to big data, and it is to these ends that the present invention is directed.