Data is abundant in many fields, including science, engineering, medicine, insurance, information systems and the like. Labeling the data is a common task that precedes further use of the same. For instance, data may be inputted from a sensor in an analog format and converted into a digital format or directly in a digital format. Otherwise, the data may be inputted in a raw format from a database. Then, the data is analyzed and labeled by a recognizer. By “labeled” it is meant that the recognizer applies cognitive or substance based identifiers to the data, for instance, to identify peaks, troughs, patterns and trends of particular significance.
A non-exhaustive and non-limiting list of recognizers include fluorescent dye detection software for deoxyribonucleic acid (DNA) sequencer assays, fingerprint detection and identification software, voice recognition and identification software, speech recognition software, facial recognition and identification software, optical character recognition software, part of speech taggers in natural language processing, document relevance determination in an information retrieval setting and quantitative analysis software for investing, finance and the like.
The characteristics and attributes of the data set (e.g., a length of the data set and an observed frequency of each label in the data set) are estimated based on the output of the recognizer. For instance, the output of the recognizer can be tallied and this crude count forms the estimate of the attributes of the data set. However, it is nearly impossible to tell whether or not these characteristics and attributes are actually correct. Even if a human manually reviews the labeled data set, there is no way to know whether the correct label has been applied. Additionally, there are many issues with manually reviewing the labeled data set, including the time and cost of performing the review.
It is for these reasons that computers are often employed to label the data in the first place because the computer will behave quickly, with reduced expense and precisely (i.e., consistently, whether right or wrong). Additionally, the criteria of comparison may be difficult to present for human review, but more readily coded for computer-based analysis (e.g., if the difference between two or more labels is not readily perceivable by a human judge).
If multiple recognizers are available, the outputs of each recognizer can be averaged to better approximate the characteristics and attributes of the data set. For example, if numerous recognizers apply the same label to a data point of a data set and/or the number of recognizers is increased, then the certainty that the label is correct increases. Unfortunately, the number of recognizers may be limited. For instance, there may not be a sufficient number of recognizers available to correct the data to within a desired level of accuracy (e.g., below a 1% error rate). Alternatively, the cost of using additional recognizers may be prohibitive, thereby constructively limiting the number of recognizers that are available for use.
If the accuracy of each recognizer is known, the correct characteristics and attributes of the data can be better approximated using a weighted average. Unfortunately, the accuracy of each recognizer may be unknown or prohibitive to determine. Known techniques for determining the accuracy of a recognizer involve manually reviewing the output from the recognizer or comparing the output to a test set, which is time consuming and costly.
Further, the accuracy of each recognizer may vary across different subsets of data. This variability skews the weighted average by relying upon an incorrect or overly simplified statement of the accuracy of the recognizer. If the circumstance of the subset of data were known, then it would be possible to account for this variability. However, assessing the circumstance of the subset of data suffers from the same problems as manually reviewing the output.
Other known techniques for determining the accuracy of a recognizer involve automatic comparisons performed using test sets in which correct values of the test set are presumed or known to be correct (i.e., staged data having labels defined as correct). These techniques fail when no test set is available.
In U.S. Patent Publication 2009/0080731 entitled “System and Method for Multiple Instance Learning For Computer Aided Diagnosis” a system and method determines the maximum likely inference of the accuracy of medical labels utilized in cancer stage cells of an image. However, the technique does not rely on the frequency of label voting patterns and such as system would be advantageous commercially. As a result, there is a need for techniques to automatically assess the attributes of the data set and/or the recognizer used thereon where the correct label of the data points of the data set are unknown (i.e., in an unsupervised context).
Knowledge of the correct label enables calculations to be made concerning relevant statistics of the data and/or the performance of the recognizers, such as the prevalence of correct labels in the data set and the accuracy of the recognizer. Knowledge of the accuracy of the recognizer, in turn, enables the prevalence of correct labels to be calculated.
However, there are no known techniques to calculate or infer the prevalence of the labels and the accuracy of the recognizers when the correct label(s) of the data point of the data set is unknown. A system to infer the p of labels and accuracy of the recognizers would be advantageous commercially, satisfy a long-felt need, having widespread application in diverse fields.
In the distinct field of automated decision making, such as, for example, in the field known as ensemble methods for decision, computers make decisions based on data sets as well as the outputs of recognizers applied thereto. The known techniques of automated decision making are designed such that ambiguities or imperfections in the data set or the output of the recognizers are incorporated within an acceptable margin of error, approximated (i.e., rounded using thresholds), overlooked or otherwise ignored, which enables operation in a best-efforts manner given the inherent deficiencies of the data set and the recognizers in a non-ideal context (i.e., real world application).
However, the known techniques of automated decision making do not improve or expand upon the known information concerning the data set or the recognizers. Additionally, the field of automated decision making is not concerned with determining the attributes of the data set and the recognizers in any way.
The object of the present invention is, therefore, to infer attributes of the data set and the recognizer used thereon, which, among other desirable attributes, significantly reduces or overcomes the above-mentioned deficiencies of previous techniques.