1) Field of the Disclosure
The disclosure relates generally to machine learning data classification systems and methods, and more particularly, to a computer implemented data driven classification and data quality checking system and method that uses associative memory.
2) Description of Related Art
Data that is high quality, accurate and correctly entered and received into a data driven classification or management system is of great value to numerous industries, since low quality, incorrectly input data having errors can be expensive to fix and difficult to use in making industry or technical data decisions. Data quality errors may occur in data entry systems that require assigning categories to the data, if there are too many categories to choose from. In addition, data quality errors may occur if there is inconsistent training, inconsistent technical background, and inconsistent levels of experience among data entry personnel. Moreover, data quality errors may occur from simple human error due to tiredness or lack of concentration.
Known systems and methods exist for improving data quality and errors. For example, rule-based data standardization systems with coded rules for handling patterns or for handling predetermined lists exist. However, such rule-based systems may not be feasible for complex, machine learned data driven classification and management systems because they may be expensive to maintain and expensive to derive the rules from manually. In addition, handling patterns in pattern-based systems may be laborious and time consuming, particularly for highly specialized data sets.
Thus, it would be advantageous to have a data driven classification and data quality checking system and method that is not a rule-based system or method and that improves the quality, accuracy, and correctness of data entry such as involving users manually labeling or categorizing information.
In addition, data driven classification or management systems, such as machine learning systems, may typically use predictive models or classifiers to predict future behavior or outcomes. A business or industry process may require that a predictive model or classifier reach a minimum level of accuracy. However, most predictive models or classifiers are evaluated at their overall level of accuracy.
Known systems and methods exist that just calculate the overall accuracy of the predictive model or classifier and assume that all the decisions made by the predictive model or classifier are equally difficult. Thus, if a business or industry process requires a high level of accuracy, and if the overall level of accuracy is not met, the predictive model or classifier is unusable until that level is achieved. Attempting to reach the overall level of accuracy may be expensive, difficult, and sometimes unattainable.
Thus, it would be advantageous to have a data driven classification and data quality checking system and method that does not assume that all of the decisions made by the predictive model or classifier are equally difficult and that solves the issue when a business or industry process requires that the accuracy of the predictive model or classifier requires a high level of accuracy.
Further, data driven classification or management systems, such as machine learning systems, typically require classification or scoring of records to be able to present the classified or scored records to downstream systems in a consistent manner. Known systems and methods exist for classification and scoring of records, for example, spreadsheet-based solutions. In such solutions, analysts and other users review entries record-by-record. Such record review approaches may be tedious since every record must be reviewed and scored individually, it may be difficult to easily understand how others have previously scored similar records, it may be difficult for an individual to remain consistent in his or her own scoring decisions for similar records, and it may be difficult to group similar records, particularly, similar records in which no consistent or normalized way exists to identify all similar records in a data set together for scoring. With these drawbacks, a user or analyst may spend most of the workday, where many such records require such scoring, performing the actual scoring tasks. The analyst may not be able to spend much time as a percentage of the workday performing the actual in-depth analysis that generates deep understanding of the underlying issues in order to provide a more complete resolution to a given class of problems.
Thus, it would be advantageous to have a data driven classification and data quality checking system and method that provides the capability to group similar records together to facilitate batch classifying or scoring of the similar records to be able to present the classified or scored records to downstream systems with greater consistency.
Further, data driven classification or management systems, such as machine learning systems, often use free text data as sources of data for input into the system. However, such free text data may need to be “cleaned” and/or normalized into a domain vocabulary that enables downstream systems that utilize free text data to generate more effective results.
Known systems and methods exist for “cleaning” and normalizing free text data, for example, systems and methods that identify terms and phrases, such as city names, geographic place names, aircraft model identifiers, and other terms and phrases, and that also recognize parts of speech, such as nouns, verbs, adjectives, adverbs, conjunctions, and articles. However, such known systems and methods do not recognize abbreviations, domain-specific phrases, regional terms and phrases, without pre-identifying them or applying rules or other data extraction, transformation, and loading techniques to identify these text patterns.
Thus, it would be advantageous to have a data driven classification and data quality checking system and method that provides a simple approach for developing a domain vocabulary from free text data for use in downstream systems.
Moreover, data driven classification or management systems, such as machine learning systems, may use an associative memory system using artificial intelligence, a neural network, fuzzy logic, and/or other suitable technologies capable of forming associations between pieces of data and then retrieving different pieces of data based on the associations. The different pieces of data in the associative memory system may come from various sources of data.
Where industries employ associative memory approaches to perform data classification, it is desirable to develop a control set for the associative memory records or instances the industry uses. Known systems and methods exist for developing a control set for use with an associative memory, for example, developing a control set consisting of a random selection of a specified percent of records. However, such control set does not typically cover the diversity of records necessary for the associative memory to perform well over an unscored data set. In addition, many of the selected records may be very similar, which may result in the associative memory inaccurately scoring the records with few or no neighbors or low similarity of the records.
Another known method for developing a control set for an associative memory involves searching for keywords and adding a sampling of those records to the control set in an interactive fashion to educate the associative memory on one group of components at a time. However, such known method may be slow, inaccurate, and difficult to calculate the total effort required.
Thus, it would be advantageous to have a data driven classification and data quality checking system and method with a control set for an associative memory that has a desired and required diversity, accuracy and size, and that facilitates the associative memory in accurately scoring additional and future records.
Accordingly, there is a need in the art for a data driven classification and data quality checking system and method that have improved accuracy and quality, are reliable and efficient, and that provide advantages over known systems and methods.