Business enterprises rely on their ability to access and understand large volumes of heterogeneous data, that is, data of mixed organization and/or a variety of sources of information in a variety of organization formats. As the volume of the business data has steadily increased, the difficulty in understanding and interacting with the large volume of data has also increased, and typically at a greater rate than the data growth. A typical business relies on a wide range of heterogeneous data in situations where the data itself may be rapidly evolving. For example, stock items are ordered from a variety of vendors via purchase orders, entering inventory as they are received, the associated data sometimes having particular formats, and sometimes having different formats, or utilizing a previously unrecognized formats. In addition, customers place sales orders which are fulfilled from inventory, creating shipping waybills, invoices, and account statements with comparable data format variations. Periodically, a company aggregates these individual transactions into reports which may be organized by sales region, by month or quarter, or by product line. Modern companies need the ability to generate these reports quickly, efficiently, and as they are needed. However, significant time and effort can be required to generate useful analysis under conventional approaches.
Some conventional relational database management systems (“RDBMS”) manage such disparate sets of information by consolidating comparable elements into relatively homogeneous tables linked by associations. For example, there may be a table of vendors, each of which is associated with products they supply in an inventory table, which in turn is associated with orders in an order table also associated with a table of customers and with tables of billing and shipping records. These pre-constructed data connections and layouts are called the database schema. Design of database schemas can profoundly affect both data consistency and database performance. This can be especially true for transaction-oriented database update operations necessary for applications such as inventory management.
These highly structured databases are efficient but inflexible, a limitation often revealed when, for example, the database used to maintain transactional sales and inventory information is also used as a source of aggregate information. Attempting to aggregate information into end-of-month or end-of-quarter reports from transactional sales and inventory information can be a significant burden. The report generation requires access to many records per query and many data fields per record. Aggregation and reporting is a usage domain for which conventional RDBMS systems are not optimized. Indeed, the update-in-place operations that facilitate transactional efficiency in a RDBMS tend to thwart, for example, long term trend analysis by overwriting historical data with updated data, requiring coarser-grained time series solutions such as snapshots and external data marts to be applied.
In some systems, programmatic logic can be maintained in application programs. When executed the logic generates business reports. The reporting logic typically includes carefully crafted SQL requests to the RDBMS to, for example, create a list of active customers for a given month through analysis of all sales for that month. Modifying such reports or adding additional sources of data to the repository can potentially require changing both the database schema and the business logic within the application program which accesses it.
Other approaches have attempted to address some of these issues. In recent years alternative forms of data storage have been developed which are optimized for interactive analysis and report generation. Some approaches forgo the rigid structure and fast transactional processing capabilities of the RDBMS for a more flexible data layout optimized for performance under the read-oriented query load of report generation and analysis. In such a system, heterogeneous records are grouped together rather than being partitioned into distinct tables; the concept of “schema” is thus less applicable to the overall data layout of the entire database, and more to the particular attributes associated with any given data record. Although popularly called “schema-less” databases, such systems are more accurately identified as “self-describing” or “schema per record” systems.
Such database organizations can be distinct from, but may also be combined with physical storage adaptations such as a “column based” rather than “row based” data storage architectures. Such data storage architectures can optimize for data read access when data for particular subsets of attributes (i.e., column) must be evaluated across a wide range of records (i.e., rows), as may often be seen during report generation or interactive data analysis.
Still, some conventional approaches do not address all the needs associated with understanding and interacting with large volumes of rapidly evolving data.