1. Field of Invention
The present invention relates generally to the field of computer system management and in particular, to the automated management of creation test/development databases.
2. Terminology
The following terms are referenced in this document. They are helpful in understanding the context of the document and purpose of the product.
As-Needed
Used in this document to refer to rows in a parent table that relate to dependent rows in a child table based upon physical or logical referential integrity.
Database
A collection of information that is organized so that its contents can be easily accessed, managed, and maintained.
Database Query
A command to a database to select, insert, update or delete some of the data contained within it. Structure query language (SQL) is used to perform database queries. While the different database types have varied underlying structures and processes, all databases process SQL queries. A SQL query that is written following ANSI standards should be able to run on all database types.
DBA
Database administrator—maintains the database.
DDL
Database Definition Language—a script used to define the structural components of a database (e.g., tables, indexes . . . ) but not the data within them.
Destination (Replicate or Target) Databases
These are the databases that are (re)created and/or (re)populated, based on information in a source database, as part of the Present Invention's extract process.
Drop
To drop a database component means to delete it from the database (e.g., you can drop a table in a database if you no longer wish to use it). This is a non-recoverable destructive action, meaning the only way to recover a dropped table is from a prior database backup.
Extract Process
In this document, the term refers to the set of steps involved in replicating a database and/or migration of its data.
Foreign Keys
Foreign Keys are physical constraints in a database used to enforce the referential integrity between two tables. Foreign keys can also exist logically but the referential integrity defined by logical foreign keys must be implemented through the applications that use the database.
LOB
Large Object Binary—A large text or binary object that is placed into the database, such as an image file or a MS Word document.
Masking (Blinding)
See Transform, below.
MIS/IS
(Management) Information Systems—This organization, existing within most companies, typically oversees the information maintained by that company's adopted technologies (e.g., relational database systems . . . ).
Parent/Child Relationship
A parent child relationship between two tables indicates how data in these tables is related. The parent record must always exist before the child record does—for instance, the Gender table must contain the entry Male before an entry can be placed into the Employee table, which is related to the Gender table through a foreign key, with a Gender Type of Male.
Primary Key
The combination of one or more columns whose contents uniquely defines a row within a table (e.g., the Client Identification column in a Client table . . . ). While not required, any table containing rows of data that can be uniquely identified should have a primary key.
Proprietary Data/Information
This is information that sensitive. Databases usually contain proprietary business and client information (e.g., client names, pricing data, social security numbers . . . ). This information should be limited to a small number of individuals within the organization and should never be propagated outside of the company. Not only is this information important to the well being of the company but the company also has the responsibility of protecting the client information it maintains.
Referential Integrity
This is the relationship between various tables in a database. This relationship can be defined physically within the database as a set of rules, meaning that any changes made to the data would need to conform to the predefined set of rules in order for the data changes to take place (e.g., each customer must have an address and valid phone number . . . ). The relationship can also be maintained logically through applications that use the database. This is risky since there are ways for users to modify the data in the database directly, without using the application, and get around the rules defined in that application.
Relational Database
A database whose data items are organized in a set of formally described tables from which data can be accessed or reassembled in many different ways without having to reorganize the database tables. Like a spreadsheet, each table is comprised of entries called rows and each row is made up of one or more columns.
Relational Database Vendors/Types
Companies that produce, market, and sell relational databases. Each of these companies has its own relational database type (e.g., Sybase has Adaptive-Server, Microsoft has SQL Server, Oracle has Oracle, IBM has DB2 . . . ). They are also referred to as RDBMS.
Scalar Data
Basic types of data stored in a database such as numbers, character strings, dates, etc. In this document scalar data refers to basic data types but does not include LOB data.
Schema
A schema is a collection of objects within a database owned by a specific user. Typically an Oracle database will contain multiple schemas while Sybase and SQL Server will tend to create separate databases for each “schema”.
Source (Production) Database
The source database is what the Present Invention will use to create the destination database. Typically it is a production database in which a company stores its operational information and upon which its runs its applications that use and share that information. Depending upon how their information is segregated, a company may have one or more production databases and these databases may share information between each other.
Table
A table is the basic repository for storing data in the database. For instance, the Client table may hold client information in the database.
Transform
A transform is a function that manipulates the contents of a particular field in the database (also called a column) during the extract process in such a way as to make it impossible to determine the original value. For instance, the “Standard US Phone Number” transform converts the Phone_Number field in a specified table to a random 10-digit string whose first digit is >=2.
Current Manual Process
In a mature area of technology, such as the relational database field, it is surprising how many recurring database operations are still performed manually. One of these operations is the creation of test and development databases. These databases, typically created from a portion of the company's production database, are created to support various teams performing application development, system testing, production support and other related initiatives. This raises two concerns.
First, the overall replication process is usually inconsistent and manually intensive. It requires substantial input and effort from a variety of resources within the MIS/IS organizations. Data requirements for a test database are usually defined by a combination of the business, development and testing groups. Developers then construct the extract process to pull the information out of the production database and then coordinate the creation and/or loading of the test database with the Database Administration group. This costly effort is multiplied by the need for numerous extract processes required to support different initiatives—for instance, development groups need a smaller specific set of data compared to system performance testers who require a large diverse data set, etc. . . . This cost increases when you take into account the process involved in refreshing data into an existing test database and the possibility of “stepping on” someone in the middle of a test. Additionally, because the entire process is manual, the propensity for errors increases.
The second and perhaps more significant concern is the protection of information within the production (source) database. This is often overlooked during the test database creation process. Production databases usually contain proprietary business and client information. This information should be secured and limited to a small number of individuals within the organization and should never be propagated outside of the company. Not only is this information important to the well being of the company but the company also has the responsibility of protecting the client or internal proprietary information it maintains.
While some companies have undertaken the costly chore of building data generation processes to create production-like data sets, this is the exception rather than the rule. Those that have created such processes need to support them because over the life of the database its underlying structure is changing (e.g., new tables created, some table dropped, new columns added, different rules/constraints apply . . . ). Over time this process becomes either unmanageable, expensive or both. For these reasons, most companies simply load their test databases with a copy of their production data, neglecting the fact that the extracted information will now be accessible by a variety of individuals. Why is it done this way? Simple! The extract-method of building test data is usually performed because it is the fastest, most cost effective way to generate a usable form of the database. The production data is already in the appropriate form to support the database's referential integrity requirements. Of course, having multiple copies of a production database is not an efficient use of space, requiring additional storage expenditures. Occasionally, some effort is made to “scrub” the information, such as the removal of credit card information, while other critical information often remains intact, such as client names, addresses, etc. This “scrubbing” is not usually performed in a consistent manner and is typically a manual afterthought, varying from extract to extract.
Test and development databases, shared by a plethora of in-house and external development, integration and testing resources, are rarely afforded the restricted access policies implemented in production environments. With many databases now going overseas to support cheaper application development, there is no telling who now has access to this information and the liability now being opened by these actions. It is reasonable to assume that some of these underpaid resources could profit from either the inadvertent or intentional sharing of this information. In most cases, companies have no ability to prosecute offshore individuals or organizations that share or use their information for personal gain. Never mind the data being appropriated externally, statistically 67% of data stolen from an organization occurs by internal resources.
There needs to be an automated method for creating test databases that incorporate a comprehensive, standardized means of masking the proprietary information, supporting various concerns such as the conventions set forth in the European data privacy laws and the US Safe Harbor rules, while still allowing the applications using the database to operate normally.
The Solution
Present Invention solves these issues and many more. It automates the process of replicating databases. At the same time, it transforms proprietary information as part of the data extract process assuring the security of the data within the database. It even supports the process involved in refreshing test/development database environments.