Programs are sequences of logical operations on data. However, there may be many different practical ways to perform the same logical operation. A program comprising a sequence of logical operations can sometimes be written in more than one language, each language having its own syntax, data structures, and high level operations. In large or diverse computing environments there are frequently any number of ways to perform those operations.
In certain scenarios, it is computationally efficient to convert commands and functionality that are written in a first language into another language prior to execution. For example, given a first command in a first language which retrieves data from a relational database (a “relational Source” command) and a second command in a second language which filters data according to some criteria (a “filter” command), it would be computationally efficient to convert the second command into the first language and implement the filter as part of the relational Source command. If the first language is a database query language (such as structured query language—SQL), one way of doing this would be to convert the criteria used in the filter command into parameters inserted as part of a WHERE clause in the relational Source command. The WHERE clause would then be equivalent to the filter command. Additionally, the single relational Source command would implement the same functionality that previously required two different commands and two different languages and the associated resources (two different execution engines, interpreters, etc.).
This type of transformation is referred to herein as a “pushdown” operation. A pushdown operation, which is a type of logic translation, is the process of taking a program written in one language and translating it into an equivalent program written in another language. In this case, the filter command has been pushed down, resulting in saved bandwidth, CPU, and memory, which allows the program to scale much better and run much faster. Relational pushdown optimization is the transformation of certain program logic written in a first language into a second language used with relational databases. Pushdown optimization improves the overall program performance as the source database can sometimes process transformation logic faster than the native environment of the program.
However, the benefits and desirability of pushdown will vary depending on context and the particular command. Using the example of join commands, pushing down a traditional PK(primary key)-FK(foreign key) join (i.e. an inner join with an equality condition on unique columns) is likely to be as beneficial as pushing down a filter command. However, any other kind of join may result in a Cartesian product, which would not be computationally efficient to push down to the relational engine. This highlights an important difficulty: the desirability of a translation may be dependent on its context, which in turn highlights that a translation may be possible without being desirable.
Therefore, a current problem with existing logic translation frameworks (i.e., pushdown systems) is that the particular operations or commands to be pushed down (translated into other languages) must be known in advance, as well as the execution context, and the pushdown system must be tailored to the particular implementation.
A pushdown system could be designed that assumes that all pushdown will grow sequentially out from an adapter boundary and at every point only one kind of translation is available. While the cost of a search for alternative translations of operations in such a pushdown system would be somewhat linear, this type of pushdown system would still have a number of drawbacks.
The first drawback is that components of the system will not be reusable. Translation success will depend not only on the input logic and target domain, but also whether all of the translated pieces can fit together. For example—a Sybase query, which is a type of relational database language query used for Sybase databases, can only contain a maximum number of subqueries, so any translator that wants to include another must have a failure case for when the prior logic is too large. This makes it extremely difficult to use that translator in other contexts—not only is the intricacy of the combination code inseparable from the translation, but the dependency on the prior state also locks the translator to a specific traversal order and direction.
Another drawback is that at any given time, only one kind of translation is possible, even though important context information may not be available. A Data Transformation Machine (DTM), as used herein, refers to a program (a set of modules) that's responsible for actually executing particular transformation logic written in another program (sometimes referred to as an execution DTM or eDTM), analogous to how a Java virtual machine is a program that executes instructions written in Java. The term DTM can also refer to the individual execution of the modules or portions of modules in the program which makes up the larger DTM (sometimes referred to as a standalone DTM or saDTM) and/or the public application programming interface (API) or Data Integration Services that is interfaced with in order to control and monitor execution (sometimes referred to as a logical DTM or LDTM). In Hive, which is a data warehouse built on Apache Hadoop, transformations can be translated for execution on an saDTM, but there's a very high cost to jumping in and out of saDTMs, although consecutive saDTMs can be merged. While other transformations which may be part of the program can also be translated to an saDTM, translating them to native HiveQL (Hive Query Language) has many valuable advantages, so in many cases a translation to HiveQL would be preferable over a translation to an saDTM. However, if a subsequent operation will be translated to use an saDTM, then the cost of jumping in and out of saDTMs outweighs the benefits running a particular transformation in HiveQL. Unfortunately, since each translation of each transformation operation is done in sequence, there is no way of knowing whether a not-yet-translated operation will be translated into an saDTM. This raises problems in determining how to translate particular operations, since the contextual information needed to choose the most efficient option is not available. For example, if the translation of operation X introduces an saDTM, and operation Y can be translated either way, and operation Z might introduce another saDTM, there is no way of knowing at the time of translation of operation Y whether to translate to an saDTM or to HiveQL. The only possible options for translating operation Y in such a situation are: (1) hardcoding the solution, knowing that in some cases it will be wrong, (2) attempting to provide a way to explore both translations, which would be a complex expansion of the framework that wouldn't be useful for any other purpose and would no longer run in linear time, or (3) attempting to delay the decision by capturing all possible translations before moving on to the next, which is far more complicated than simply augmenting the search and is not feasible for most situations.
An additional drawback of such a pushdown system would be that translation can only begin at fixed points. Artificially limiting translation to specific locations can rule out some very desirable execution strategies. However, trying to search for a beginning when it already takes linear time to find the end slows the algorithm significantly.
Furthermore, multiple kinds of translation cannot interoperate in such a pushdown system. Assuming five different engines to choose from for a particular program, for example, a native engine, an SQL engine, a Hive engine, an Informatica Blaze engine, and an Apache Spark engine, four of the engines would require translation to be used. However, if all of the engines have purpose-built translation frameworks that must run to completion before any other execution plan can be considered, there is no feasible way of utilizing more than one engine to execute the program. The only solution to this in such a system is to mandate that only one kind of engine can be in use at a time. However, since each engine supports a different set of operations, it would be necessary to rely on user configuration, heuristics, and even back-out behavior to determine which engine a particular session should use.
As discussed above, there is currently no logic translation system/pushdown system which provides an efficient, extensible, simple, reliable, and composable way to discover and compare alternative implementations.