A database is a collection of stored data that is logically related and that is accessible by one or more users or applications. A popular type of database is the relational database management system (RDBMS), which includes relational tables, also referred to as relations, made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
One of the goals of a database management system is to optimize the performance of queries for access and manipulation of data stored in the database. Given a target environment, an optimal query plan is selected, with the optimal query plan being the one with the lowest cost, e.g., response time, CPU processing, I/O processing, network processing, as determined by an optimizer. The response time is the amount of time it takes to complete the execution of a query on a given system. In this context, a “workload” is a set of requests, which may include queries or utilities, such as loads, that have some common characteristics, such as application, source of request, type of query, priority, response time goals, etc.
Classification is a powerful database analytics tool that provides businesses with insight into its trends and behaviors. In general, discretization is a technique to partition continuous numeric data into intervals, and transform the numeric data to discrete or nominal data. Since this transformation significantly reduces the cardinality of the data, it is often used in data preparation for many classification tasks, such as Decision Tree and Naïve Bays. It has been reported in many literatures that training a classification model on discrete data can be orders faster than training it on the original numeric data. Discretization may also improve classification accuracy and enhance the understandability of the classification rules extracted from the model.
Discretization methods fall into two groups, unsupervised and supervised. DBMS vendors generally implement only unsupervised algorithms, such as equal-width-intervals, equal-frequency-intervals, and/or their variations, e.g., bin coding. Unfortunately, bin coding has no way to find the optimal boundaries automatically because it doesn't utilize the label information to merge or split intervals. In practice, bin coding relies on a human-specified parameter, such as interval width, or requires a manual interval adjustment to determine interval boundaries. This typically presents problems for processing large complex data. The width of intervals varies, so a fixed interval is inappropriate. In addition, intervals may have multiple mode frequencies. Consequently, it is often quite difficult to identify an optimized parameter or interval adjustments by human.