The invention relates to the field of data processing, and more particularly to the management of analytic processing against databases to distribute processing tasks to necessary processing resources.
The increase in enterprise software, data warehousing and other strategic data mining resources has increased the demands placed upon the information technology infrastructure of many companies, academic and government agencies, and other organizations. For instance, a retail corporation may capture daily sales data from all retail outlets in one or more regions, countries or on a world wide basis. The resulting very large data base (VLDB) assets may contain valuable indicators of economic, demographic and other trends.
However, databases and the analytic engines which interact with those databases may have different processing capabilities. For instance, a database itself, which may be contained within a set of hard disk, optical or other storage media connected to associated servers or mainframes, may contain a set of native processing functions which the database may perform. Commercially available database packages, such as Sybase(trademark), Informix(trademark), DB2(trademark) or others may each contain a different set of base functions. Those functions might include, for instance, the standard deviation, mean, average, or other metric that may be calculated on the data or a subset of the data in the database. Conversely, the analytic engines which may communicate with and operate on databases or reports run on databases may contain a different, and typically larger or more sophisticated, set of processing functions and routines.
Thus, a conventional statistical packages suck as the SPSS Inc. SPSS(trademark) or Wolfram Research Mathematica(trademark) platforms may contain hundreds or more of modules, routines, functions and other processing resources to perform advanced computations such as regression analyses, Bayesian analyses, neural net processing, linear optimizations, numerical solutions to differential equations or other techniques. However, when coupled to and operating on data from separate databases, particularly but not limited to large databases, the communication and sharing of the necessary or most efficient computations may not always be optimized between the engine and database.
For instance, most available databases may perform averages on sets of data. When running averages on data, it is typically most efficient to compute the average within the database, since this eliminates the need to transmit a quantity of data outside the database, compute the function and return the result. Moreover, in many instances the greatest amount of processing power may be available in the database and its associated server, mainframe or other resources, rather than in a remote client or other machine.
On the other hand, the analytic engine and the associated advanced functions provided by that engine may only be installed and available on a separate machine. The analytic engine may be capable of processing a superset of the functions of the database and in fact be able to compute all necessary calculations for a given report, but only at the cost of longer computation time and the need to pass data and results back and forth between the engine and database. An efficient design for shared computation is desirable. Other problems exist.
The invention overcoming these and other problems in the art relates to a system and method for multipass cooperative processing which distributes and manages computation tasks between database resources, analytic engines and other resources in a data network. While other systems have been capable of processing part of a SQL request in the database and the other part in an analytical engine/process in a single direction manner, various embodiments of the present invention provide for iterative, multi-directional processing of an entire report being processed against the relational database system.
The present invention provides a process for handling multiple steps in a calculation iteratively between a controlling module, a database and an analytical engine external to the database. In this processing environment, some of the calculations or functions to be performed on the data may be performed by the database itself and other calculations or functions may be performed by the external analytical engine. The controlling module resides outside of the relational database receives a report request or other non-SQL request. The controlling module monitors each step in the processing of the report, acting as director over the activities to maximize efficiency and handle complicated multi-sequence calculations so that they do not result in an error.
The controlling module generates the SQL statement needed to be executed against the relational database. Upon generation of the SQL, the controlling module directs a first initial query to the database to resolve one step in the multi-step calculation (e.g., fetching, filtering, calculation or aggregate operations). The controlling module then generates a fetch operation to retrieve the data produced by the initial query outside of the database (and the database""s control). The controlling module then passes at least some of the data produced by the initial query to the external analytical engine to perform one or more processing steps on the data. The controlling module then receives the processed results from the external analytical engine and transfers data from that result back into the originating database (e.g., in a database table) or some other database instance. Once in the originating database or the other database instance, the controlling module may direct that further processing occur using the originating database that now includes the data processed by the database and external analytical engine. To do so, the controlling module may generate another SQL statement. That further processing may be done by the database and/or data fetched and provided to the external analytical engine. These steps may continue in any order or sequence and as many times as desired until all of the processes are completed, with the controlling engine generating SQL to perform various calculations or operations. Thus, the present invention allows for multiple levels of nested calculations including calculations that may be performed by the database and those that may be performed by the external analytical engine.
The controlling module then provides the ability to pass the result back the requesting system. Also, the controlling module may direct processing to different databases so that various processes are transmitted to other databases for storage or processing. Thus, in one sequence, data could be retrieved from database A, processed by external analytical engine 1, transmitted to database B, processed with data from database B, transmitted back into database A, processed again by external analytical engine 2, and then passed back to the requester.
In one embodiment of the invention, calculations native to a given database platform nay be trapped and executed in the database, while other types of functions are transmitted to external computational resources for combination into a final result, such as a report executed on the database. In another regard, the invention may permit data including intermediate results to be passed between the computing resources on a cooperative or collaborative basis, so that all computations may be located to their necessary or most efficient processing site. The exchange of data may be done in multiple passes.