As enterprises grow, the infrastructure that supports these enterprises usually grows at an exponential pace. Moreover, the data collected and utilized by these enterprises may grow at an even faster pace than the infrastructures themselves. This data may come from a wide variety of sources including improved instrumentation, automated enterprise business processes, individual productivity software and analytics. Improved instrumentation that captures digital rather than analog data has driven the growth of scientific, engineering, and production data. In business, data growth has come from the implementation of information technology (IT) systems that automate enterprise-level business processes such as enterprise resource planning and customer relationship management, and from individual productivity applications such as email and word processing. Additionally, after enterprises capture data, they generally want to use it to improve their business processes and outcomes. However, transforming this transactional data into a format suitable for analytics generates even more data and is also a major source of data growth.
Compounding the problem of data growth is the nature of how this data is stored. In many instances, this data resides in a wide and varied assortment of environments, subsystems and databases. Because of differences in these data repositories, coalescing this gathered data into meaningful sets of related data can be a daunting process, and presenting meaningful sets of data to a user may be well nigh impossible.
These difficulties stem not so much from the presence of a large number of data repositories, but from the varying formats for storing data that exist between these different data repositories. In general, to correlate sets of data between data repositories of different format, the data in each of these data repositories must be analyzed, and manually correlated.
For example, suppose sales data resides in an Oracle database, having a customer name and associated sales data; and address data resides in an SAP database having a customer name and an address associated with the customer name. Now suppose that the Oracle database refers to the customer name as “CUSNAME,” while the field name for the customer name in the SAP database is “CUSTOMERNAME.” In this case there would be no way to automatically relate the sales data and the address for a given customer, even given that the exact same customer name was stored in the “CUSNAME” field of the Oracle database and the “CUSTOMERNAME” field of the SAP database. This is because the formats and fields used to store the same data vary between data repositories.
Typically, relating data between data repositories is a manually intensive process. To continue with the above example, to obtain the sales data and address for a given customer, a person queries the Oracle database with the customer name in the “CUSNAME” field to obtain sales data for the customer and queries the SAP database with the customer name in the “CUSTOMERNAME” field to obtain the address of the customer. As can be seen from this small example, obtaining data from a wide variety of data repositories is a time consuming task. As can be imagined, coalescing and analyzing this data is an even more difficult task, and displaying the results of these data mining efforts more difficult still.
Thus, a need exists for methods and systems for mapping between various data repositories and using these mappings to obtain, correlate, analyze and display data from these data repositories.