The present invention relates to electronic data processing, and more specifically concerns automated testing of database management systems using statements having a tight join of a plurality of tables.
Relational database management systems (DBMS), such as Microsoft SQL Server, interpret statements written in a database query language such as Structured Query Language (SQL) to create and manage database objects, to insert and update data, and to perform complex, multilevel queries against huge amounts of data. Testing these systems is recognized throughout the industry as a technical challenge of the first magnitude. SQL and similar database-system interpreters are highly complex. For example, they offer sophisticated optimization techniques and execution planning for queries input on the fly; opportunities for arcane design problems are ubiquitous. At the same time, the state space to be tested is gigantic. For a one-gigabyte database, the possible combinations of database configuration and SQL statement to be executed exceeds 102,000,000,000.
Libraries of test scripts for relational database systems typically contain thousands or tens of thousands of sample statements which are applied to a test database for comparison of their results with known correct data. Existing libraries are known to be inadequate; most commercial database systems produce a constant and substantial stream of reported bugs. However, the amount of work required to generate larger libraries quickly becomes prohibitive.
At the rate of a half hour per hand-written test statement, even a small library consumes more time than does the design of the system that it tests.
In the past, developers have employed some stochastic testing at the language level to accelerate database testing. For example, a test-case generator may choose a random mix of hand-generated fixed scripts. Choosing random parameter values in fixed scripts increases the effective number of test cases. These methods still require painstaking human composition and verification of long, multilevel queries. Automated generation of very simple queries considerably speeds up the generation of test cases, but eliminates the more complex test cases where subtle errors lurk.
In addition, conventional test systems are effectively limited to a fixed database, or to simple variations on fixed data. In order to construct statements that actually execute properly against the target database, the test system must be internally coded to produce only those statements that match the semantics of the database, the names of the database tables and their columns, the particular data types of each column, and so forth. However, testing on only one set of data obviously restricts the range of the tests that can be performed and thus the errors that will be uncovered. In addition, the use of fixed data for many tests does not permit slanting test runs toward certain kinds of applications, or focusing on the kinds of data or database structures that have been found to produce errors.
Also, the length and intricacy of test statements, although desirable for teasing out subtle errors, works against the isolation of bugs which cause those errors. Short, simple statements that produce errors are much more useful for tracking the errors down to particular parts of the DBMS under test.
In addition, automatically generating test SQL statements in a random manner has two practical shortcomings. First, when a test SQL statement refers to data in multiple SQL tables, the set of result rows often includes all combinations of data from rows in the tables in the From clause. This is called a Cartesian product of the tables. As a result, the number of output rows is the product of the sizes of the queried tables, which can impose artificial limits in the sizes of tables in the test database.
A second shortcoming results from the use of randomly generated character string constants. Some SQL database systems support special searching of large text objects. For example, the text objects might be newspaper articles and the search predicate would specify articles that contained two particular phrases xe2x80x9cnearxe2x80x9d each other. xe2x80x98Near could mean anywhere in the same paragraph. Unlike regular character string predicates involving equal, not equal, greater than, etc, the text predicates concentrate on finding words or strings that are in the text. Generating character strings constants with randomly selected characters almost always yields tokens that are not found in the text string. This reduces the effectiveness of random testing since some code paths will not be tested much.
The prior art is this field has not satisfied a longstanding need for fast generation and execution of complex test statements for sophisticated database systems that accurately model results sets used in real world applications.
The present invention speeds up the generation of database test cases by orders of magnitude. A typical generator running on a personal computer having a single 200 MHz microprocessor outputs 700 SQL statements per second, about a million times faster than a human. The queries are complex and can have multiple nested levels. They have valid semantics as well as valid syntax; that is, they will run correctly on a bug-free database system, using whatever sample database is selected for a test run. The statistical and other features of the test cases are configurable. A test operator may choose the syntactic elements selectable in queries or other statements, the frequency of their use, and parameters such as the maximum subquery depth.
Briefly, the invention achieves these and other objectives by reading configuration data containing a set of test parameters, reading the schema of an arbitrary database, then constructing a number of test statements that are syntactically correct for the DBMS being tested, that are semantically compatible with the target database, and that have content and characteristics pursuant to the configuration data. One or more DBMSs under test execute the statements and return result data. Execution errors are detected, as well as result-data differences. Error-producing statements can be converted into greatly simplified statements that provoke the same error, in order to facilitate fault isolation.
Generated SQL statements include predicates that are tightly joined in order to avoid results sets that comprise the Cartesian product of the data in the tables. In one aspect of the test system, a From list contains N tables. A list of N sets of table names is then created. Initially, each table set contains one table name from the From list. With each iteration, pairs of table sets are uniformly selected, and a table is uniformly selected from each table set. A column from each selected table is chosen and a predicate equating, or otherwise relating, the two columns is ANDed into the Where clause. The two selected table sets are merged into one, and the two selected table sets are then removed. The process iterates until a single table set remains.
A further aspect of the system is that text strings to be included in predicates can be selected from a dictionary. The dictionary is created by sampling the text columns in the target database and extracting a random collection of actual words to place in the dictionary. Words from the dictionary can be randomly interspersed with randomly generated words to form argument values for the full text predicates. The fraction of dictionary words used compared to randomly generated words can be a configuration parameter of the automated SQL testing tool. A separate dictionary is built for each text column in the database. The number of words, to be placed in the dictionary can be a configuration parameter (specified either as a constant or as a percentage of the total number of bytes in the text column).
Other features and advantages of the invention, as well as modifications within the scope of the invention, will occur to those having routine skill in the art from the following detailed description, taken in conjunction with the accompanying drawing.