1. Field of the Invention
This invention relates to systems, methods, and programs for utilizing application development tools, and more specifically, for utilizing a debugger to debug external programs such as stored procedures and user-defined functions that are used in database processing systems.
2. Description of the Related Art
A distributed processing environment includes two or more processing nodes connected by a communications link, i.e., a network. At each processing node, there may be at least one processor, and any assortment of memory devices, storage devices, and input/output devices such as keyboards, displays, etc. The advantages of a distributed processing system include being able to utilize the resources (e.g., files, applications, tools, storage, etc.) resident at another node in the network. However, technical difficulties arise in trying to utilize and take advantage of these resources in a distributed environment. More specifically, technical difficulties arise when parts of an application program execute on different machines, especially when one desires to debug such an application program. Difficulties also arise when a debugger may reside on a different machine than the program to be debugged or may reside anywhere within a network of machines.
When developing a user's program that may be running in a distributed environment and is embedded within a larger application not written by the user, it is often difficult to debug the user's program. For example, for a CICS transaction, a DB/2 stored procedure, trigger, user-defined functions, and code utilizing object-oriented frameworks, the developer of the larger program, in many cases, will only provide an application program interface (API) and a specification of its behavior. Source code for the larger program or information on its implementation is often not provided to allow its developer greater flexibility in making changes such as enhancements and bug fixes. In addition, for distributed client/server applications such as a CICS transaction, DB2 stored procedure, or DSOM server, the application programs are executed under the control of a software system such as the CICS transactions manager, DB2 database management system, or DSOM runtime environment. As such, the software system and various parts of an application program may be executing on different machines. This makes it difficult to run the application program under a debugger because it is not easy to determine when the application program will be invoked by the software system and, in some cases what process and/or machine it will run in. In addition, it is difficult, if not impossible, for a debugger to penetrate all the layers of code in the large software systems in order to find the smaller programs (e.g., stored procedures, user-defined functions, CICS transaction) to debug.
CICS (Customer Information Control System) is a transaction monitoring system which provides a high level of efficiency for executing, including executing concurrently, thousands of multiple short running application programs, i.e., transaction programs. For example, many financial transactions, e.g., debiting accounts, crediting accounts, transfers, etc., may run under a transaction monitor. Transaction monitors that run on workstation platforms such as the UNIX operating system and OS/2 operating system have similarities in their architecture. Such transaction monitors include CICS/6000, Tuxedo, and Encina, to name a few.
The technology of running external programs on the server side of a Relational Database Management System (RDBMS) has been developed in the past few years. For example, DB2/Common-Server for AIX/6000 (DB2/CS) supports external programs. External programs are written by the application developer in a host 3GL language such as C and C++. They could be invoked by the RDBMS in one of two forms: (1) as a subprogram running inside of the server's "address space" (also referred to as "process"), or (2) as a stand-alone application running on the server machine. The former is usually referred to as User-Defined Functions (UDF), and the latter is referred to as Stored Procedures.
A stored procedure allows the application developer to break a database application program into a client part and a server part. A stored procedure is a precompiled program that is stored at the server site and invoked from the client. The server part can issue SQL language request while running on the same machine as the RDBMS server, such as a DB2/CS server. Results from the execution of the stored procedure can be passed back to the client part which is usually running on a different machine. In many applications, this can greatly improve the performance of the database application. Stored procedures reduce the network traffic between the client and the server. Other advantages include concealing a variety of system-specific and/or database-specific details from the user, thereby providing a greater degree of data independence; sharing a stored procedure by many clients; and providing enhanced security by authorizing a given user to invoke a given procedure but not to operate directly on the data accessed by that procedure.
In addition, the following described invention is also applicable to triggers. A trigger allows a user to associate certain SQL operations with a certain event. A trigger is a procedure that is to be invoked, by the database management system, when a specific condition occurs. For example, a trigger could be invoked to check the range value of employees' salaries whenever an SQL update is made to an employee/salary table to ensure that no salary goes above a certain amount. More specifically, an external trigger is a trigger than runs externally from the system software, i.e., a DBMS.
UDF (User-Defined Function) is, as its name implies, a function that takes in input and returns single value results. As a function, it must be embedded in an SQL expression. UDF is an innovative feature introduced in release 2 of the IBM DB2/6000 relational database system. In addition, the new SQL3 standard provides a definition for user-defined functions. A user-defined function extends the functionality of the underlying DBMS (Data Base Management System) by allowing users to define their own functions implemented in host languages such as C and C++. Once created in the DBMS, UDFs can be invoked from any context where a SQL expression is expected or invoked from within any SQL expression as if they were built-in functions.
In a practical implementation, UDF could be run under two kinds of environment. For better security, a "firewall" is created between the database engine run-time and the UDF run-time. This is normally achieved by running UDF code in a separate process which is different from the engine process. Such a UDF is called a fenced UDF. In contrast, for better performance, UDF code is run in the same process as that of the database engine. Such a UDF is called an un-fenced UDF.
There are two components associated with a UDF: the specification component and the realization component. The former is specified by the create function SQL statement and maintained by the DBMS. The latter is specified by code written in a host language, prepared using the host language compiler and linkage editor, and maintained by the operating system. Different UDFs can share the same implementation. In other words, one realization component can be associated with more than one specification component. One situation where this can occur is when two UDFs have the same functionality but different runtime environments: one runs in a fenced mode, and the other runs in an unfenced mode.
The first step of creating a UDF, i.e., the specification component, is to register its name and attributes by executing a create function SQL statement. Taking find_string as an example shown below, the specification component is expressed by the following create function statement which specifies that UDF find_string:
takes a LONG VARCHAR argument and a VARCHAR(128) argument as inputs, PA1 returns an INTEGER as output, PA1 is implemented by an external C function with the same name which is contained in the library udf_string in the path /afs/almaden.ibm.com/u/fuh/udf/lib. PA1 is prepared to run in an unfenced environment, and PA1 has various other attributes which are of no interest in this discussion. PA1 cc -c -o udf_string.o udf_string.c PA1 ld -o udf_string -bE:udf_string.exp -H512 udf_string.o PA1 SELECT*FROM test_table WHERE find_string(text,`User-Defined Function`) &lt;&gt;0 PA1 `. . . can be achieved by using User-Defined Function` PA1 Agent process initializes UDF process in which the target UDF will run. UDF process could be either agent process itself (for unfenced UDF) or a new process spawned by agent process (for fenced UDF). In both cases, UDF process runs as a daemon process. This is a one-time initialization meaning it takes place only for the first time UDF is invoked. PA1 Agent process prepares arguments in the format which is ready to be passed to the external function. PA1 If the associated external function is not resident in memory, UDF process loads the appropriate library into its address space. Such a library will be unloaded at the end of the current transaction. PA1 UDF process resolves the entry point for the desired external function. PA1 Agent process requests UDF process to execute the external function. PA1 Agent process extracts the function result and converts it to DBMS's internal format.
For example, the following statement registers a new function find_string with the DBMS:
CREATE FUNCTION find_string(LONG VARCHAR, VARCHAR(128)) RETURNS INTEGER LANGUAGE C EXTERNAL NAME '/afs/almaden.ibm.com/u/fuh/udf/lib/udf_string!find_string' NOT FENCED NOT VARIANT NO EXTERNAL ACTION PARAMETER STYLE DB2SQL NOT NULL CALL NO SQL;
The first two lines define the interface between the UDF find_string and the context in which it is invoked. The following two lines indicate that find_string is implemented by the C function find_string of the library udf_string located in the path /afs/almaden.ibm.com/u/fuh/udf/lib. The rest of the statement is of no interest in this discussion.
The realization component is prepared by compiling and linking the UDF library with appropriate options. The following command lines show a typical way of preparing UDF libraries. udf_string.c is the source file containing the external function find_string, and udf_string.exp is the export file exporting the symbol find_string.
As such, the second step is to build a C library, udf_string, which contains external function find_string:
 void find_string (char *text /*The text in which sub_string is searched*/ char *sub_string,/*The sub-string */ long *position, /*Position in text where sub_string is found*/ short *null_ind_i1, /*First input null-indicator */ short *null_ind_i2, /*Second input null-indicator */ short *null_ind_r, /*Result null-indicator */ char sqlstate, /*Error code issued by UDF */ char *udf_func_name, /* UDF function name */ char *udf_spec_name, /* UDF specific name */ char *msg_text) /* Msg. text returned by UDF */ { char *start, *cptr1, *cptr2; int start_position; /* Initialization */ start = text; start_position = 1 ; /* Set up new string for comparison */ try_again: *position = start_position++; cptr1 = start++; cptr2 = sub_string; /* Compare sub_string with the new string */ /* NOTE: the first two if-branches should be swapped */ compare: if (*cptr1 == '.backslash.0') { /* End of text reached; not found */ *position = 0; return; } else if (*cptr2 == '.backslash.0') { /* End of sub_string reached; found */ return; } else if (*cptr1 == *cptr2) ( /* So far so good; continue comparing */ cptr1++; cptr2++; goto compare; ( /* Comparison fails; try next string */ goto try_again; }
The purpose of find_string is to return the position in text of the first occurrence of sub_string. If sub_string does not occur in text, the return value is 0. At run-time the DBMS prepares and passes actual arguments to external C function find_string via its first two parameters, text and sub_string. The return value is passed back to the DBMS via the third parameter, position, and the sixth parameter null_ind_r which indicates when the result is null. The other parameters of find_string are unrelated to this discussion.
With both the specification component and realization component, the DBMS is equipped with all the information needed to execute UDF find_string as if it were a built-in function. Once the C library is installed in directory /afs/almaden.ibm.com/u/fuh/udf/lib, UDF find_string can be invoked from within any SQL expression. The usage of UDF is demonstrated in the SQL statement below. Let test_table be a table which contains a column text of type LONG VARCHAR. The following query can now be issued to extract all the rows of test_table where the text column contains the substring "User-Defined Function":
Without UDF, an SQL statement is intrinsic in the sense that its semantics are predefined by the underlying DBMS. In other words, given a reliable implementation of the DBMS, one can assume that the result returned by the DBMS always respects the intent specified in the SQL statement. If find_string were a pre-defined function, one would expect that the above statement computes the set of rows of test_table where the text column contains the substring "User-Defined Function". However, such an assumption does not necessarily hold in the presence of UDF. In fact, the C code given above contains a couple of defects: first, the null result indicator is not initialized, which may cause the function to appear to return NULL; second, it does not catch the situation where the substring "User-Defined Function" ends the text. As a consequence, rows with the following text content will be excluded from the result:
Despite the simplicity of the C function find_string, it may still take a while to figure out that the defect is caused by misplacing the "end-of-string" checking for text before that of sub_string. In general a UDF can use any feature of the host language and be arbitrarily complex; hence, UDF debugging becomes an important issue.
The flexibility offered by UDFs comes with the responsibility to ensure correctness. If the UDF logic contains an error, incorrect results or failure of the SQL expression may occur. In a client-server environment such as IBM DB2/6000, none of the conventional debugging methodologies works for a UDF.
The ability to easily debug a UDF running online, that is in the environment of the DBMS, is a desirable feature, especially in the support of the development of UDF. Online debugging support complements the stand-alone development and testing of UDF code by exposing behavior unique to the UDF runtime environment and making it convenient to reproduce the situation causing the failure, compared to developing a stand-alone driver and equivalent data set.
The following describes a UDF run-time environment, as well as the difficulty with debugging UDF code in the online DBMS. FIG. 1 characterizes a RDBMS run-time environment in which the technology of user-defined functions, stored procedures and triggers are applicable. On the server side 191, there is a set of control processes, represented by the DBMS Control Unit box 193, created at database instance creation time. The control unit listens to the client side 192 to receive the next "connection" request. For each connection request, the control unit spawns a new set of processes which is referred to as an agent unit 195, 197. Therefore, each client connecting to the DBMS has a set of processes (agent unit) associated with it. The newly created agent unit 195 is connected to the corresponding client 194 for receiving and serving the subsequent database requests. Among these processes, (agent unit), there is a distinguished one called agent process which repeatedly takes a request from client, serves the request, and returns the result, if there is any, back to the client. As such, the agent unit consists of an agent process 199 and, optionally, a set of fenced UDF processes 181 and/or a set of stored procedure processes 183.
The agent process 199 runs the major part of the database engine code and is therefore also referred to herein as the "database engine." The agent process takes the request from the client and distributes the request to various service components of the database engine to accomplish the requested action. Typical service components of a RDBMS include a front-end router, a query compiler, catalog service, data management, buffer management, security management, concurrency-control, interpreter and run-time service, etc. Besides the database engine code, the agent process also runs un-fenced UDF code 185 and the run-time code that supports the un-fenced UDF.
A fenced UDF process runs the run-time code which communicates with agent process, dynamically loads UDF library, and runs fenced UDF code. The number of fenced UDF processes varies from one implementation to another. For example, there could be one fenced UDF process shared by all the fenced UDFs invoked in the current application or there could be more than one such processes, each serves a set of fenced UDFs that are run under the same user ID. The technology presented in this invention is not dependent on the number of fenced UDF processes.
In many aspects, a stored procedure process has the same run-time characteristics as that of a fenced UDF process. Unlike a fenced UDF process, a stored procedure process runs the external program as if it were a stand alone application sitting on the server machine. In general, there could be more than one stored procedure processes associated with a client program.
FIG. 2 shows an example of a client program 194 named myclient that executes a stored procedure 183 named mysp. The stored procedure then executes a SQL SELECT command that contains a call to a fenced UDF myudf.
FIG. 3 shows the same example as FIG. 2 except that the fenced UDF named myudf has been changed to an un-fenced UDF 185. In this case the un-fenced UDF is run in the agent process.
Referring back to FIG. 1, to create a UDF, the client sends a create function statement as a request to agent process. In response to this request, agent process analyzes the statement and registers all the attributes of the UDF being defined by the statement in the DBMS. To use an already registered UDF, the client requests the agent process to execute a DML statement with a UDF invocation in it. In response, the agent process interprets the statement and returns the result to the client. As part of the interpretation, the agent process invokes the top-level run-time support routine for the evaluation of the UDF. The following activities are performed by UDF run-time support routines:
Three characteristics of the run-time environment are of special interest. First of all, UDF code runs at the server site as opposed to within the application's process space. Therefore, ordinary program debugging methodology won't work for UDF. Secondly, UDF process may not exist until the first UDF is invoked and it is difficult to predict when the UDF library is loaded into memory. As a consequence, the debugging methodology based on process attachment does not work for UDF either, as process id and debugging information of UDF library cannot be known in advance. It might seem that this problem can be solved, by having the system post the process id in a file where the debugger can access it. Unfortunately, for security purposes, UDF process normally runs under a designated UID unrelated to a client's UID. This makes the UDF process unattachable from local applications, not to mention applications running as remote clients. The reloading of UDF libraries is one example of a factor of the runtime environment that could potentially confuse the user. The user may wonder when his code is being refreshed, why static variables are being reinitialized, etc. As such it illustrates the utility of a debugging methodology which can lend insight into the UDF runtime environment. The last characteristic of the UDF runtime environment is that it runs as a daemon process with the standard I/O file handles disabled. Therefore, run-time status of UDFs can not be made observable by inserting printf statements. All of these characteristics make UDF debugging very difficult using existing debugging methodologies. This could affect the productivity of application development using UDF and hence discourage users from using UDF.
As such, there are three obstacles to debugging UDF code using a conventional symbolic debugger. These obstacles are especially apparent when a debugger is expected to be initiated from the application's site.
Timing
A UDF process is created on the demand and the external library is loaded/unloaded dynamically. Unloading occurs at the end of a transaction, such as when a COMMIT is performed. There is no convenient way to inform users when UDF code is available for debugging.
Authorization
For security purposes, the DBMS process usually runs under a special UID so that normal users cannot attach to it.
Remote debugging
In general, a remote user does not have a user account on the server machine. This makes it extremely difficult, if not impossible, to debug UDF process on the server machine.
As shown above, external programs are not statically linked with any executable module. Instead, they are dynamically loaded by the database engine when the associated UDF, stored procedure, or trigger is invoked. This makes it extremely difficult to debug external programs running in a client-server environment. This run-time environment has no practical debugging method. The current practice is to write a test driver program which simulates the RDBMS call to the external program.
The main problem with debugging external programs is that the external programs are executed under the control of the database engine which is itself a large software system for which no source code is provided. It is therefore impractical for a debugger to penetrate through the layers of software of the database engine to locate and debug the external programs. It is also very difficult for the debugger to determine when an external program will be invoked by the database engine and which process it will be run in. External programs are not statically linked with any executable module. Instead, they are dynamically loaded by the database engine at run-time when the associated UDF or stored procedure is about to be invoked which further complicates the situation for the debugger. In addition, in an environment where the DBMS is shared between a large number of users, it is necessary to ensure that the debugger does not violate the security of the DBMS or the underlying operating system.
Debugging multi-threaded applications in a distributed environment is known. One approach is to attach a debugger to the main software system and attempt to penetrate all of the software layers to get to the desired part of the program to be debugged. Another approach will stop execution in all of the threads and processes in order to debug an application running in one of the threads. Stopping the execution of processes in a single user environment may be acceptable, but such an approach is unacceptable in multi-user environments or where hundreds of processes may be running as in a CICS transaction monitoring system.
The Parallel and Distributed Dynamic Analyzer (PDDA) is a parallel and distributed debugger that can debug parallel and distributed applications including distributed middleware (e.g., CICS/6000, DCE and DSOM) and DB2/6000 application programs that use RPC and threads (see the PDDA manual).
It is also known to have "application program initiating" debugging wherein an application program, itself, requests the services of a serial debugger when the debugger and application are running on the same machine.
However, it is not known to have an application program, itself, initiate debugging in a distributed environment where the debugger may be running on a different machine, or on several machines, different from the machine the application program is running on. Furthermore, it is not known to have "program-initiating" debugging for RDBMs external programs or transaction programs in a distributed environment. In a distributed environment, a problem arises in not being able to locate the machine or process within the machine that a desired debugger is running on. This difficulty is similar for locating any application development tool desired by an application running in a distributed processing environment.