Data analysis and feature recognition systems, such as described by Brinson, et al., in U.S. patent application 2007/0244844, which is incorporated by reference in its entirety herein, or any user-specified, preset, or automatically determined application or engine engaged for use in the same or a similar manner, afford users the capability and capacity to perform data analysis of, feature recognition within, and/or data transformation of the representative data of a given data set selection(s), regardless of the medium or data type, modality, submodality, etc., in an effort to extract useful or meaningful information from what otherwise may be hollow or enigmatic data, to provide visualization alternatives, and/or to aid in the formulation of hypotheses or conclusions relating to or derived from the original data. The processes used by such a system vary by application use, genre pertinence, user specifications, system requirements, and the like but generally invoke the system to recognize the unique data values and patterns characteristic of a particular target feature(s) of interest (hereafter “known feature(s)”) within a given data set and subsequently distinguish these associated data values and patterns upon evaluation of succeeding data sets containing the previously identified known feature(s).
For example, a user desires to identify areas of deforestation in imagery of a South American rain forest. Using evaluation algorithms and a sampling area of data (i.e., any collection of data elements surrounding or associated with one or pluralities of centralized data elements) (hereafter “target data area”), a data analysis and feature recognition system is employed to analyze a selection of data, which is known by the user to contain the known feature of interest “Deforestation,” from the original data set. In an alternate embodiment, pluralities of evaluation algorithms and the target data area (TDA) are used to evaluate the original data set in its entirety. Ultimately, the algorithmically ascertained data values and patterns, which are crucial to the definitive recognition of the inherent characteristics of “Deforestation,” are then stored, or trained, into a previously established data storage structure, such as an algorithm datastore or any user-specified, preset, or automatically determined storage device (e.g., a data array, database, data output overlay, value cache, datastore) capable of at least temporarily storing data, as the newly identified known feature “Deforestation.”
In the aforementioned example, “Deforestation” is uniquely identified from other features in the data set selection by the distinct combinations of calculated algorithm values and patterns obtained from analysis of the original data set selection using the intentionally selected and ordered pluralities of evaluation algorithms and/or the specifically chosen TDA. Accordingly, future attempts to positively identify “Deforestation” within subsequent imagery of the South American rain forest using this previously trained algorithm datastore require the use of the same pluralities of evaluation algorithms and the same TDA as were used to train the algorithm datastore in the first place. This ensures proper comparison and correlation of the uniquely identified data values and patterns, which are associated with “Deforestation” and stored in the algorithm datastore, with the newly procured and analyzed data of subsequent imagery.
Such a data analysis and feature recognition system as described previously is often plagued by system, data, ground truth, and/or algorithm inadequacies that can make successful user interaction with the system problematic. The inherent deficiencies of such a system are examined forthwith.
Assume a user has trained a data analysis and feature recognition system to recognize the data values and patterns of the known feature “Deforestation” using specific pluralities of evaluation algorithms and a given TDA. Afterwards, the user discovers that slightly or even heavily modified pluralities of evaluation algorithms, which can include several additional algorithms not originally included or previously available, are actually better suited for delimiting “Deforestation” in subsequent imagery. If the user processes the subsequent imagery using these modified, yet more pertinent, pluralities of evaluation algorithms with the ultimate intent being the identification of “Deforestation” within the imagery, all historical training becomes inapplicable thus resulting in recognition of few or potentially no data values and patterns associated with “Deforestation.” This is because a change in the algorithm set results in the calculation of different data values and patterns so that even when “Deforestation” exists in subsequent imagery, the system is unable to recognize or reveal it. Once more, since all historical training of “Deforestation” is irrelevant when the pluralities of evaluation algorithms are altered, the situation demands that the known feature “Deforestation” either be retrained, which can be time or cost prohibitive, using the original data set and the new pluralities of evaluation algorithms or the subsequent imagery must continue to be evaluated using the original pluralities of evaluation algorithms, which can be inaccurate or flawed.
Similarly, in an alternate instance, the user realizes that the TDA used to evaluate a given data set is ineffective or deficient and chooses to alter the TDA in order to achieve improved analysis results. As described previously, all historical training of the known feature “Deforestation” is lost if the TDA used to evaluate subsequent imagery is different from the TDA used in the assessment of the original data set. This is because a change in the TDA results in the calculation of algorithm values on different data and patterns thereby preventing correlation of the previously trained data with the newly trained data.
A second difficulty, which tends to be prevalent in many data analysis and feature recognition systems on the market today, is a situation most aptly described as “algorithm confusion.” Simply, algorithm confusion is the inability of the evaluation algorithms, which operate on a given TDA, to correctly distinguish between different sets of data values and patterns that have been previously trained and stored in a datastore as a different known feature(s). This feature misidentification is compounded when the algorithm datastore is queried to identify the previously trained, unique known feature in a subsequent data set. In such a situation, the evaluation algorithms are unable to definitively distinguish between different known features, which exist within the same data set but possess distinct data values and patterns.
Thirdly, a typical data analysis and feature recognition system may be unable to correctly distinguish between individually trained known features as a result of “data confusion,” which describes a scenario wherein identical data values and patterns within a given data set are trained as more than one different known feature. Data confusion can result from two primary causes: (1) during training, the user makes a mistake, such as overlapping a portion(s) of the training selection areas for the affected known features; or (2) the resultant data values of the given TDA are identical to those at a different location(s) due to a lack of sensor resolution or sensitivity in the data recording mechanism. This is especially problematic because, in the case of data confusion, erroneous information is introduced at the training stage of data analysis, and this misinformation can persist as long as the training is being used. Once the system is in use, there is no functionality available to allow the user or the system to go back and review the training selections made during earlier stages of algorithm datastore construction.
Complicating matters further is that use of the standard methods of algorithm datastore training for the purpose of feature identification does not commonly enlighten the user as to whether conflicting or erroneous datastore training is due to ambiguous source data (i.e., data confusion), faulty data selections during training (i.e., data confusion), inadequacy with regard to the TDA selected for use to solve the problem at hand (i.e., data confusion), scantiness of the evaluation algorithms employed to distinguish between the data values and patterns representing the different known features (i.e., algorithm confusion), or another unidentified problem.