Some embodiments of the present disclosure are directed to an improved approach for implementing interfacing an R language client with a database engine. More particularly, disclosed herein are methods and systems for providing a transparency layer for interfacing an R language client with a database engine.
Users of the “R” statistics language want models to be authored using exclusively the R language in a comfortable R language environment. However legacy R client environments run on small “client” machines with limited memory and data space, thus preventing R language-based analysis to be performed on large datasets. Some R client environments implement a client-server model that adds flexibility yet does not resolve the problem exhibited by legacy implementations that require the entire dataset to be resident “in-memory” within the client's memory space. What is needed are techniques for authoring in the R client environment while transparently accessing and manipulating vast data hosted on a large database engine. Moreover, what is needed are techniques for performing the foregoing without the need for rewriting R language code. Prior approaches have introduced new constructs to the language in order to accommodate server-client techniques for accessing and manipulating vast data. In some cases, legacy approaches had to pull the data from the database into the R engine to perform the computation described in the R language and then push the results back to the database. Given the in-memory nature of the R engine, such a solution does not scale, is not secure, and is usually limited to sequential, i.e., non-parallel, execution.
In some situations, such as in the case that the R user is also a database user, the user could write additional scripts in a query language (e.g., SQL) and execute via a connector (e.g., ODBC) which connector could shuttle database resident data. However, R users are typically not database users and thus the legacy approach has been to perform offline data extractions from the database for each specific R analysis task, often using a separate organization such as IT in an enterprise. Such a legacy technique is not only inefficient, it also introduces data governance issues when accessing sensitive data as is often found in an enterprise setting.
Worse, some or all of the following problems exist in the legacy approaches:                R users need to know SQL which is not a skill extensively possessed by the R user community at large.        R users must deal with the transformation of data from a database format into a format that is amenable for processing by the R engine.        Legacy approaches suffer from a lack of options to exploit parallelism.        R users must explicitly handle a mixture of output objects from R for storage back into the database.        Even in situations where an R user and a DBA are co-located or can otherwise cooperate, the hand-off of R code from an R user to a SQL DBA is a non-trivial process.        
Therefore, there is a need for an improved approach for interfacing an R language client with a database engine.