A database may be thought of, at least in theory, as an organized collection of data, very often defined in connection with schemas, tables, queries, reports, views, and/or other objects, and very often organized in a logical, object-oriented, relational, and/or other manner. Databases have become fundamental components for many modern computer systems and, in this regard, database management systems (DBMSs) typical include computer software applications that interact with one or more users, other applications, and the database itself, e.g., to facilitate the definition, creation, querying, updating, administration, etc., of the databases and/or the data associated therewith.
Databases, directly or indirectly, support a wide variety of applications. For instance, databases underlie computerized library catalog systems, flight reservation systems, computerized parts inventory systems, etc. Some databases support lead tracking and sales-related metrics. Other databases support organizations' human resources functions including, for example, maintaining employees' personal information, vacation time, performance, and/or the like. Other databases support accounting functions, are involved in economic data analysis, and/or the like. So-called business-to-business (B2B), business-to-consumer (B2C), and other patterns of purchasing also are typically enabled by databases.
The advent of so-called Big Data has placed a number of challenges on modern computerized database technologies. Although there are a number of different definitions of Big Data, those skilled in the art understand that it generally refers to datasets so large and/or complex that traditional data processing applications are inadequate. Challenges also arise because Big Data oftentimes is not structured, which makes it difficult and sometimes even impossible to process using conventional database systems. Challenges arise in areas including data analysis, capturing, curation, searching, sharing, storage, transferring, visualization, privacy, and/or the like. Indeed, with so many different information sources, so many non-standard input source types, the ability to store so much information, and the desire to critically analyze it, challenges associated with how best to manage such data are growing.
Certain example embodiments address the above and/or other concerns. For instance, certain example embodiments help manage “bad” or “imperfect” data. For instance, the industry standard for databases used in procurement involves only 16% clean and current data. Although organizations oftentimes are concerned about their “bottom lines,” procurement in healthcare-related contexts can have unfortunate complications. Certain example embodiments provide a lifecycle technology solution that helps receive data from a variety of different data sources of a variety of known and/or unknown formats, standardize it, fit it to a known taxonomy through model-assisted classification, store it to a database in a manner that is consistent with the taxonomy, and allow it to be queried for a variety of different usages. Thus, although it typically is technologically infeasible to create “perfect data” (especially, for example, in Big Data contexts), certain example embodiments help manage imperfect and/or bad data, e.g., promoting data integrity and/or consistency, in a manner that self-learns and/or evolves over time.
One aspect of certain example embodiments thus relates to transforming unstructured textual and/or other data to enriched, cleansed, and well-formed data. Another aspect of certain example embodiments relates to classification to a taxonomy, which can in at least some instances advantageously provide an indication regarding what a given record or data-point in question is. This may in turn allow inferences about the associated entry to be made, e.g., such that the attributes that are important or use to know can be identified. Furthermore, enrichment of the type described herein can be used to “fill in the blanks” in terms of the missing attribute information.
In certain example embodiments, a data classification system is provided. An input interface is configured to receive documents comprising data entries, with at least some of the data entries having associated features represented directly in the documents. A data warehouse is backed by a non-transitory computer readable storage medium and configured to store curated and classified data elements. A model repository stores a plurality of different classification model stacks, with each classification model stack including at least one classification model. Processing resources, including at least one processor and a memory, are configured to at least: inspect documents received via the input interface to identify, as input data, data entries and their associated features, if any, located in the inspected documents; and segment the input data into different processing groups. For each different processing group: one or more model stacks from the model repository to be executed on the respective processing group is/are identified; each identified model stack is executed on the respective processing group; results from the execution of each identified model stack are ensembled to arrive at a classification result for each data entry in the respective processing group; the classification results are grouped into one of first and second classification types, with the first classification type corresponding to a confirmed classification and the second classification type corresponding to an unconfirmed classification; for the first classification type, each data entry in this group is moved to a result set; for the second classification type, a determination is made as to the processing group from among those processing groups not yet processed that is most closely related to each data entry in this group, and each data entry in this group is moved to its determined most closely related processing group; each data entry in the result set is stored, with or without additional processing, to the data warehouse, in accordance with its arrived at classification result; and records in the data warehouse are able to be queried from a computer terminal.
In certain example embodiments, a data classification system is provided. An input interface is configured to receive documents comprising line-item data entries, with at least some of the line-item data entries having associated attributes represented directly in the documents. A data warehouse is backed by a non-transitory computer readable storage medium and configured to store curated and classified data elements. A classification model stack includes (a) a plurality of classification models, (b) a plurality of confidence models, and (c) a related multi-level taxonomy of classifications applicable to line-item data entries included in documents received via the input interface. Processing resources, including at least one processor and a memory, configured to at least: execute classification models from the classification model stack to associate the line-item data entries included in the documents received via the input interface with potential classifications at each level in the related taxonomy; execute confidence models from the classification model stack to assign probabilities of correctness for each potential classification generated by execution of the classification models; determine, for each of the line-item data entries included in the documents received via the input interface, a most granular level of potential classification that meets or exceeds a threshold value; designate a classification result corresponding to the determined most granular level of potential classification for each of the line-item data entries included in the documents received via the input interface; store each line-item data entry, with or without additional processing, to the data warehouse, along with an indication of its associated classification result; and enable records in the data warehouse to be queried from a computer terminal.
Corresponding methods, computer readable storage mediums tangibly storing instructions for executing such methods, and/or the like also are contemplated.
The features, aspects, advantages, and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.