Many large information management entities and cloud providers are looking at Massively Parallel Processing (MPP) appliances as well as derived data sources to offload central processing unit (CPU) intensive workloads. This has several benefits as CPU activity can be offloaded from “expensive” environments onto cheaper solutions (where expense is based on CPU consumption). Queries can also be offloaded to environments that are custom built to handle those types of queries.
The offload of queries is generally managed by a DBMS optimizer or by some application level logic to determine: (1) if the data exists on the target environment; (2) if the syntax of the query is supported by the target environment; (3) if the query is well suited for the environment (e.g., through costing optimization); and (4) if any applicable configuration parameters, settings, or environment variables allow routing to the environment.
In order for these appliances or offload sources to be able to execute such “offloaded” queries, copies of the data must be created and refreshed at the offload source. This can be done through a number of load options that will each lead to varying degrees of data latency in the data copies, in which data latency is defined as the data currency between a source table and a derived source. Each option has a different cost. For instance, data replication is often costlier than doing a nightly bulk load of data. Therefore, most entities will replicate a subset of the data and choose other load options for the remainder of the data. The decision on which option to use is driven by the needs of the application user, and the cost/benefit of maintaining the data at the requested latency.
A significant problem for users wishing to take advantage of accelerators or other secondary data sources is that database administrators (DBAs) have difficulty determining when and how best to create additional copies of data (i.e., derived sources) on those facilities. Today, determining whether a query can be offloaded to a secondary source is based on a number of parameters. One of the most important of those parameters is the data latency that the end user can tolerate. This latency tolerance must be directly communicated to the DBA's. If it is not communicated, the application's queries will access the source data. If the latency goal is communicated to the DBA, the DBAs still have no way to decide which load strategy they should use, other than to cater to that application user's specific need. This is problematic in that one user may require a refresh level or latency of less than one day, whereas another user may require a latency of less than one hour.
Depending on the cost of copying data to the offload sources and the value derived from offloading queries in the application, it may, e.g., be beneficial to maintain: a derived source with less than one hour latency to cater to queries in all applications; maintain a derived source with less than one day latency to cater to queries in a single application, and allow queries from other applications to go against the source data; or allow queries from all applications to go against the source data.
This decision really depends on the cost of the various data copy options, and the benefit from being able to offload each application's queries. The benefit can only be determined by looking at the number of queries that can tolerate a specific latency level, determining how many of those queries are candidates for offload, and then looking at the business benefit (e.g., cost savings, performance improvements, batch window shortening, etc.) from offloading those queries.
Significant challenges DBAs or cloud providers are having include:
I. The current method(s) of determining a data replication/refresh strategy for derived sources typically does not take into account the return on investment (ROI) from creating those derived sources. Much of the cost data (e.g., load process cost) is typically not available. Because of this, creation/maintenance of the derived sources may end up costing the organization rather than creating value. Creating a load/replication strategy should take into account: (1) Cost of each of the load option (including network and target specific apply costs); (2) The minimum latency that can be achieved with each load option; and (3) The number of queries and the CPU/elapsed time that can be offloaded for each load option.
II. DBA's do not have access to the aggregate data source latency requirements of the applications/incoming queries for each data source. The aggregate requirements are important to understand the value that can be generated by creating/maintaining the derived source (e.g., data source, number of queries, aggregate query CPU time, aggregate query response time, latency tolerance threshold, etc.).
III. DBA's do not have a way to analyze individual queries based on latency requirements. This is important to understand which queries in a latency group are capable of being offloaded. Only those that can be offloaded will contribute to the “value” component of an ROI computation. Queries may not be candidates for offload because of low estimated CPU/run cost or the syntax not supported in the offload environment.
Today, the communication of latency goals between the application teams and the DBAs is generally through direct communication (if at all). Thus, assuming a new query is developed that requires a lower latency, this would have to be communicated between the application teams and the DBA, meaning that the application teams have to be knowledgeable about the underlying data. In cases where such communication does not occur, the workload is defaulted to the source data because the application teams are unaware that making latency requirements known could reduce costs. For Materialized Query Tables (MQTs), the latency expectations can be driven using the “current refresh age” and the “current maintained table types for optimization” special registers. However, those only help the optimizer match the query request to already existing system maintained MQTs. It does not help advise the DBA on a table creation and loading strategy for derived sources based on latency goals. Moreover, because the optimizer canning make runtime decisions, DBMS driven MQT/derived source matching is not possible with static SQL. This is a major challenge, i.e., knowing how to refresh derived sources and segment workloads in such a way that applications that need real time data continue to run against the operational data, while applications that can tolerate latency are run against the derived source when possible and appropriate.
By giving the DBAs knowledge of application level latency requirements rolled up by data source, a cloud provider or search optimization provider could make copies of data available on lower cost solutions (cost of storage, cost of processing, cost of maintenance, etc.) and re-route application requests to the lowest cost environment based on latency expectations.