A database is a collection of stored data that is logically related and that is accessible by one or more users. A popular type of database is the relational database management system (RDBMS), which includes relational tables made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
To extract data from, or to update, a relational table in an RDBMS, queries according to a standard database-query language (e.g., Structured Query Language or SQL) are used. Examples of SQL statements include INSERT, SELECT, UPDATE, and DELETE.
As applications become increasingly sophisticated, and data storage needs become greater, higher performance database systems are used. One such database system is the TERADATA(copyright) database mangement system from NCR Corporation. The TERADATA(copyright) database systems are parallel processing systems capable of handling relatively large amounts of data. In some arrangements, a database system includes multiple nodes that manage access to multiple portions of data to enhance concurrent processing of data access in updates. In TERADATA(copyright) database management systems, concurrent data processing is further enhanced by the use of virtual processors, referred to as access module processors (AMPs), to further divide database tasks. Each AMP is responsible for a logical disk space. In response to a query, one or more of the AMPs are invoked to perform database access, updates, and other manipulations.
One of the goals of a database management system is to optimize the performance of queries for access and manipulation of data stored in the database. Given a target environment, an optimal query plan is selected, the optimal query plan being the one with the lowest cost (e.g., response time) as determined by an optimizer in the database system. The response time is the amount of time it takes to complete the execution of a query on a given system.
The optimizer calculates cost based on statistics of one or more columns (or attributes) of each table. Statistics enable the optimizer to compute various useful metrics. Typically, statistics are stored in the form of a histogram.
In database systems that store large tables, the cost of collecting statistics for such large tables can be quite high. As a result, some database users may choose not to collect statistics for columns of tables over a certain size. The lack of statistics for some tables may adversely affect operation of certain components in the database system, such as the optimizer and other tools.
In general, a mechanism for faster collection of statistics in a database system is provided. For example, a method for use in a database system comprises receiving a request to collect statistics of at least an attribute of table, and collecting statistics for the attribute based on reading a sample of rows of the table, the sample being less than all the rows of the table.
Other or alternative features will become apparent from the following description, the drawings, and the claims.