1. Field of the Invention
Embodiments of the invention described herein pertain to the field of computer systems and software. More particularly, but not by way of limitation, one or more embodiments of the invention are directed to computer systems and methods for validating data object classification and consolidation using external references.
2. Description of the Related Art
Corporate data typically resides in multiple disparate systems, databases, spreadsheets, email and other applications across an enterprise. Where there are multiple locations and versions for an item of data, there are data consistency problems. For instance, in a case where only a most current version is needed and two different versions of the same basic data may exist business decisions and business processes that utilize alternate versions of this data are error prone and can lead to costly mistakes. Errors resulting from data inconsistency eventually lead to a decreased ability to make effective decisions. There are various approaches to resolving this data integrity problem one of which involves the use of a locally controlled and maintained database. Such an approach however is impractical in that effective enterprise systems for implementing business processes must generally make use of internally maintained data and externally generated data that varies more than data that is locally controlled, formatted and maintained within an organization. This internal and external data is provided to systems for implementing one or more business processes in the form of data objects.
An example of an external data object is any document or data presented to an organization from an outside entity such as a vendor or contractor. These external data objects are generally formatted inconsistently and differ from vendor to vendor. For example, invoices from multiple vendors may contain an item categorized as “printer paper” in one invoice, while in another invoice the same item may be categorized as “office supplies” or “stationery.” The same item may be presented in invoices with multiple differing part numbers unique to each vendor as well. It is for these and other reasons that the purchase of a particular item from different sources across an enterprise requires significant additional manual labor. The inability to group purchases for items belonging to a category for example reduces the likelihood of negotiating a better price for the item from a single vendor. This example is not unique to transactional processing. All business areas and business processes that utilize duplicative data suffer higher costs and higher failure rates for a variety of reasons. For example, publication of out-of-sync information to a catalog, website or data pool can magnify duplicative, old or erroneous data and prove costly to other areas of a business that must then deal with the consequences of the problem data.
It is commonplace for any relatively sophisticated business process to also make use of internal data objects. An example of an internal data object is any document or data generated within an organization for internal and sometimes or external use. To make effective business decisions competitive businesses generally utilize internal business data as part of the decision making process. When the data in these systems is not shared throughout the organization to the key decision makers and made consistent, inefficiencies occur. Achieving consistent data across multiple distributed heterogeneous systems within an organization is difficult. Establishing effective communication links between disparate systems is a prerequisite to making the data consistent, but does not alone solve the problem. Even when internal data is effectively shared throughout an organization, problems still arise in that over the course of time the data may exist in different forms and models. Since the achievement of data consistency is difficult it is common for companies to maintain internal data in independent realms. For example, because of the difficulties associated with merging internal data, some companies independently maintain data for each of their different corporate divisions and only utilize such data for business decisions relevant to a particular corporate division. The maintenance of independent systems often occurs after mergers and acquisitions where company systems are almost certainly heterogeneous and typically utilize radically different structures and data models.
Regardless of the origin of data, whether internal or external, organizations typically seek to coordinate interaction between heterogeneous systems to minimize redundant data. Current business systems begin by classifying and/or identifying similar and overlapping data and then coordinating the integration of such data in a way that ensures the data stays consistent across different systems. One approach some organizations use is to maintain what is known as master data. Master data may be thought of as the definitive version of a data object and may include customer, product, supplier and employee data for example. Known solutions for coordinating the master data, i.e., classifying, storing, augmenting and consolidating, are generally insufficient. Moreover, the fact that master data may exist does little to provide information technology personnel with insight about the process used in determining if an object matches another object or belongs to an existing classification or validating these decisions. Failing to successfully coordinate master data objects yields data object redundancies and inconsistencies that disrupt the business decision-making process and increase the overall cost of doing business. In cases where customer data is included into the master data and becomes out of sync, customer service suffers from incomplete data requiring customers to call multiple places within the same company to obtain the required information. In some cases the failure to efficiently service customers causes enough frustration that it begins to result in decreased customer loyalty and ultimately leads to a loss of customers. By utilizing master data, a business entity may consolidate, synchronize, distribute, centrally manage and publish any type of master data across an enterprise and with trading partners. Utilizing master data enables improved customer acquisition and retention, cross-sell and up-sell, global spend analysis, workforce management, new product introduction, cataloging and publishing, sourcing, procurement, inventory management, shipping and invoicing.
Information about internal and external data objects is exchanged amongst different parts of the system using what is called transactional data. For instance, when new data is submitted or when data objects are updated, modified or deleted transactional data messages are sent to the components of the system that require such information. Transactional data presents challenges to companies that interact with each other when electronically exchanging information. It is possible to transfer transactional data in a multitude of different formats (e.g. EDI, Excel, XML, PDF, Text) and to send this data over numerous networks and network topologies (e.g. AS2, SWIFT) and standards (e.g. EDI-HL7). One of the complexities that exists with respect to the transfer of transactional data is that the data is sometimes incomplete. Attributes and identifiers may, for instance be abbreviated, incomplete or even missing. In addition, the quantities of transactional data objects and hence the daily updates may be very large and hence require significant system resources. Since the format and content of the transactional data is non-uniform, every batch received may yield minimal reuse of data and/or process decisions. For example, when transactional data such as an invoice is presented to a company, the invoice may be normalized and/or transformed into XML. Each line item may then be classified and/or consolidated with other items in the invoice or within the master data. Generally, applications exist that allow for the classification of items using rules or artificial intelligence. However effective systems for performing automated validation of classification or consolidation decisions made with respect to the transaction data are lacking. Data objects with missing or abbreviated attributes for example may fail classification or even worse may be classified incorrectly if there are chance associations or relationships in the data. Current methods for validating the classification and consolidation of business objects rely on a form of “know it when I see it” manual processing that is labor intensive and error prone. The list of companies that perform classifying, cleansing and normalizing of transactional data objects is large. However, existing solutions do not scale to the transactional quantity required and do not provide the necessary confidence level required to make effective decisions. The reason for this is that existing systems typically use manual work and rely, for example, on two employees separately making classification or consolidation decisions. Although this is in some cases an effective form of validation, it lacks the scalability, automation and confidence needed to be truly effective in an enterprise context. Confidence intervals are a common form of interval estimation. An example of a confidence level is the probability value associated with a confidence interval. If U and V are statistics i.e., “observable” random variables, whose probability distribution depends on some unobservable parameter theta, and Pr (U<theta<V)=x, (where x is a number between 0 and 1) then the random interval (U, V) is a “(100·x)% confidence interval for theta”. The number x, or 100·x expressed as a percentage, is then called the confidence coefficient or confidence level.
Manual validation of a classification or consolidation requires time and labor intensive inquiries generally via email or telephone. This process is inefficient when the number of records becomes large. If external references were used in order to corroborate the missing/hidden/abbreviated attributes from records for example, then automated classification and consolidation validation could occur. Because of the limitations described above there is a need for a system and method for validating data object classification and consolidation using external references.