1. Field of the Invention
This invention pertains to database access optimization, and more particularly to automatic and transparent denormalization support.
2. Description of the Related Art
Avoiding data anomalies is one of the most important database design considerations. One method of avoiding some data problems in relational databases is normalization. One drawback to normalization is that certain queries become much slower, because it is necessary to merge information from several tables. It is possible to denormalize the normalized tables, but that has the problem that programmers and users must write more cumbersome code.
Database integrity maintenance is a database design consideration of primordial importance. Relational databases, unless well designed, are prone to a number of different types of inconsistancies. Consider the relation schema
DEPT (Name, Number, Budget, Managing Employee, Manager Extension) PA1 DEPT (Name, Number, Budget, Manager, Manager Extension) PA1 Name.fwdarw.Manager PA1 Number.fwdarw.Manager PA1 Name.fwdarw.Number PA1 Number.fwdarw.Name PA1 Name.fwdarw.Budget PA1 Number.fwdarw.Budget PA1 Manager.fwdarw.Manager Extension PA1 DEPT (Name, Number, Budget, Manager) PA1 EMPL (Name, Extension) PA1 DEPT (Name, Number, Budget, Manager, EMPL.sub.-- Extension) PA1 EMPL (Name, Extension).
and suppose a rule that one person can manage more than one department but has only one extension. That schema is subject to several of the problems described in Ullman, J., Principles of Database Systems, 2d ed., p212. Redundancy is the first problem discussed there. It is illustrated in the above relation schema by the fact that if the same employee manages more than one department, that person's extension would be repeated for each such department.
The second problem is potential inconsistency, also known as update anomalies, which in the example occurs when one person manages multiple departments, if the manager's extension changes, but is only updated in the row associated with one of those departments, then more than one extension would be stored for the same person. And, more important, the copies that have not been updated would be incorrect.
The third problem is insertion anomalies. Because the relation stores the telephone extension only for persons who manage a department, it is impossible to use the table to store telephone numbers for other individuals. A way to circumvent that problem, by storing a null-string for Name and Number, is unattractive because those attributes may be keys to the relation, and it is inappropriate to have NULL as a value for a key. That can be intuitively understood, because the DEPT relation holds information concerning departments, NULL is not a department and, thus represents something which does not belong in the relation. Furthermore, if a person who has not previously managed a department becomes a department manager, then there is a risk that the existing entries for that person are not deleted. The result would be multiple phone numbers for the same person.
A fourth problem is deletion anomalies. That problem is illustrated in the example by the fact that if a department is deleted from the database, because it is merged with another department, then the extension of the person who manages that department is lost.
An argument could be made that the database only contains managers, not any other employees, and therefore, the update and delete anomalies are not relevant. However, if we suppose that persons retain their extensions when transferred, then in a reorganization, it would be necessary to update the manager extension field as well as the manager field in the DEPT relation. Also, if the database is to contain extensions for non-manager employees, then the schema would have to contain a separate relation for those persons, which would mean that a manager demotion would involve updating multiple relations.
Thus, it is important in designing a particular database to organize the data in such a fashion as to avoid these types of problems. Normalization is one method for arriving at a schema which inherently avoids data anomalies. For a detailed discussion on normalization see Date, C. J., An Introduction to Database Systems, 3d ed., pp. 237-265, or Ullman, J., Principles of Database Systems, 2d ed., pp. 211-262.
Normalization is a formalistic approach to achieving the maxim that a good database design stores exactly one fact in one place. It begins with analyzing the data to determine functional dependencies. A set of attributes are functionally dependent on another set of attributes in the same relation if for any value of the latter set there is only one value for the former set. Or, a set of attributes A in relation R is functionally dependent on a set of attributes B in R if for every value of B there is only one value for A. A functional dependency may be designated with an arrow, B.fwdarw.A, which is read "A is functionally dependent on B" or "B functionally determines A".
Other types of dependencies are multivalued dependencies and fully functional dependencies. Those are discussed in detail in Date.
Consider the relation schema
If there is only one extension per manager and, if there is only one manager per department and departments are uniquely defined by either Name or Number, then the following are some of the functional dependencies for the relation:
Simplistically, normalization entails removing functional dependencies into separate relations until all tables have been normalized. Hence, a possible decomposition would be to break out the last of those functional dependencies into a separate relation schema:
In those cases where the EMPL.Name is equal to the DEPT.Manager, the EMPL.Extension would be equal to the Manager Extension in the original DEPT schema. Furthermore, the decomposition avoids the anomalies described above. Without discussing the details of various normal forms (that discussion may be found in the previously mentioned references), we make the stipulation that the DEPT and EMPL relation schema are normalized. Because the department manager's extension has to be retrieved from two tables, through a join operation, that query is slower in the normalized database than it would have been in the unnormalized version of the DEPT relation. The price for avoiding data anomalies is slower retrieval.
One method of avoiding the extra time a join requires is to selectively denormalize the database. Whereas normalization entails removing redundancy, denormalization is adding redundancy. In the previous schema the Extension may be added to the DEPT relation, yielding the following schema:
The problem with denormalization is that the programmers and users must word their queries in a manner that avoids data anomalies. The denormalized database is equally prone to inconsistency and other data problems as is an unnormalized database. Therefore, every data manipulation must maintain data integrity. Furthermore, data retrieval must be worded so that information is retrieved from the denormalized relations whenever possible, otherwise denormalization is of no benefit. An additional expense is that if a relation has been denormalized, then all programs which access the denormalized relation, or the base relation from which the denormalization has been obtained, must be rewritten. Hence, performance tuning using denormalization is a very involved procedure.