Insurance companies deal with large amounts of data. This data includes: account information, such as insurance premiums, deductibles, coverage limits, and the like; profile information, such as social security numbers, age of each individual covered by a plan, marital status, employment information, residence address, length of commute to work, accident reports, driving records and the like; and property information, such as the types of properties covered by the insurance including homes, boats, cars, recreational vehicles and the like. By way of non-limiting example, an insurance company may have records about an individual representing the type of car that is owned including make, year, model, color, where the car is domiciled, whether the car is garaged, driveway parked, or parked on the street, the individual's age, residence, and commute distance, family history, and medical conditions. An individual's information may be expanded depending on the number of insurance or other products that the individual has with an insurance company. Factor this amount of data with the hundreds of thousands or even millions of individuals that an insurance company represents, and it is evident that the amount of data is extremely large.
In order to handle the data efficiently, insurance companies have turned to databases, and in particular to relational databases. A relational database is a grouping of data using common characteristics within the data set. The performance of a relational database is based on the amount of data that is included in the database, the commonality of the characteristics, and the query optimizer of the database. As the amount of data contained in relational databases has increased in the growing computer explosion, and the fact that the computer age has allowed insurance companies to incorporate and track more information, the performance of these relational databases has suffered.
A performance degradation of relational databases can have a dramatic impact on insurance companies. For example, insurance companies reduce the queries that are performed, thereby working with less information than an optimized database can provide. Additionally, the backup file sizes for these ever growing databases impart an ever increasing cost based on the size and the ability to create backup files. Further, random access memory and buffer need to be larger and/or more data pages need to be processed from disk.
This leaves the query optimizer as a driver of reversing the trend of slower relational datable performance as the optimizer minimizes the impact associated with the overall size of the database. The query optimizer operates by understanding facts about the data in the database. How well these facts are known, how many facts are understood, and the ability to collect the underlying facts determine the ability and functionality of the query optimizer.
In order to clearly set forth the issue related to the query optimizer and statistics collection, an analogy will focus on two rooms of boxes with a goal of moving all of the boxes together into a single room. By way of analogy, the rooms are data tables in a relational database and the boxes represent rows in the tables. Having two rooms presents two options—move all of the boxes from the first room to the second room or vice versa. Not knowing anything about the rooms, the boxes or the number of boxes in each room, makes the decision of which set of boxes to move blind as to which option is cheaper and more efficient. That is, easier to achieve. So without any information, a blind decision is made and all of the boxes in the first room are moved to the second room and the goal of getting all of the boxes in one room is achieved. However, it is unclear if the goal was achieved in the most efficient manner.
Now using the same initial scenario, some information about the number of boxes in each room is known; the first room has ten boxes and the second room has one. Using this information, the most efficient solution would be to move the one box from the second room to the first room.
Now with ten boxes in the first room and one box in the second room, it is determined that each of the ten boxes in the first room weighs one pound and the one box in the second room weighs five-hundred pounds. With this additional information, a more efficient solution is to move the ten one pound boxes from the first room to the second room.
Now in this scenario, additional information arises that there is a hand truck available in the second room. Using the hand truck, the five-hundred pound box is not too difficult to move and therefore the more efficient solution is to use the hand truck and move the one box from the second room to the first room.
These scenarios demonstrate, by way of analogy, that even with only two options, as more information is provided about the environment, the cost associated with selecting one room of boxes to be moved over the boxes in the other room changed and thus one of the options outweighed the other in efficiency. Gaining information about the number of boxes in a room and the weight of the boxes is statistics collection. Statistics collection provides information to optimize the task at hand. Providing information helps in selecting the most efficient optimization strategy, but misinformation and/or partial information may often lead to the selection of an inefficient solution.
Thus, there exists a need for a system and method that enables collection of statistics associated with a database, and identifies situations where only partial information about the database is obtained.