Embodiments described herein relate generally to information analysis, and more particularly to methods and apparatus for concept-based analysis of structured and unstructured data.
Organizations often utilize sophisticated computer systems and databases to inform and automate portions of the decision-making process. Many such systems organize relevant data into a structured format (such as a relational database), making it accessible by a broad array of query, analysis, and reporting applications. Some of these systems programmatically calculate business decisions and make assessments based on available data and program logic. However, often much of the information relevant to these calculations is stored in a variety of unstructured formats—such as handwritten notes, word processor documents, e-mails, saved web pages, printed forms, photographic prints, and the like.
Because typical systems are incapable of organizing and searching the content of such documents, their decision outputs are generally based on only the subset of pertinent information that exists in structured form—rendering these outputs incomplete and at times inaccurate. Those systems that do incorporate unstructured data into their decision-making algorithms often convert text information into a coded form that can be stored in a structured format (such as a relational database field). This approach is undesirable, however, because much context and meaning can be lost when a complex idea conveyed in language is shoe-horned into a simple, coded form.
Further, traditional techniques for logically combining such coded data are susceptible to producing false positives, as correlations between factors that contribute to a given decision output are not accounted for in such models. More specifically, in a given scenario in which multiple factors contribute to a particular outcome or determination, many systems generate a determination based on the number of those factors present in a given data set—defining rules that assume an increased likelihood of a given output for each additional factor present in the data set. This approach is flawed, however, because two or more of these factors may not occur independently in the data. For example, two or more such factors could be positively correlated, such that the presence of a first factor always implies the presence of the second. In such a scenario, if the first factor is present, the presence of the second factor does not increase the likelihood of the particular output under consideration. This flaw can result in the generation of a false positive, as the system inappropriately includes the presence of the second factor as an additional weight in its decision calculus.
Additionally, the inability of a system to properly incorporate unstructured data into its calculations forces individuals to consider the relevant unstructured documents separately—without the significant aid of computer processing power. This laborious task not only greatly increases the time and cost of the decision-making process, but also introduces additional imprecision, as individuals are unlikely to analyze data with the consistency and speed of a computerized solution. Finally, individuals are unlikely to optimally combine their own intuitions regarding a set of unstructured data with computer-generated analysis of structured data to reach an accurate final conclusion.
Thus, a need exists for methods and apparatus that programmatically organize and analyze structured and unstructured data together, and apply business logic to make accurate determinations based on that data. A need further exists for methods and apparatus that analyze and make a determination about a set of data, using techniques that avoid the false positives that often result when contributing factors and concepts are positively-correlated within the data.