One of the primary applications of computers is operating databases. Computer databases are collections of data that are stored in a computer memory and can be accessed through the computer. Many databases, such as the phone number list in a cell phone, are fairly small. Some databases, such as the stored tax records of every taxpayer in the United States, are exceptionally large. Exceptionally large databases are expensive to store and maintain. Accessing data in an exceptionally large database can be slow, expensive, or both because the desired data must be found and accessed in the midst of a great magnitude of other data.
Most databases are centered around the concept of tables. Tables are made up of rows and the rows are made of columns of data. For example, the taxpayer database could have a row for every individual taxpayer. The columns in each row hold specific data. The first column can be the taxpayer's identification number, the second column can be the taxpayer's surname, the third can be the taxpayers forename.
An index can be used to help find rows in a table. The index associates index keys with table rows. The taxpayer's identification number is a good index key for the taxpayer database because it is unique. It is unique because no two taxpayers are supposed to have the same taxpayer identification number. A combination of the surname and the forename can also be used as an index key. Such an index key, however, can match multiple rows in the table. Multiple indexes can be used to map different index keys to table rows. The index, or indices if more than one index is maintained, is usually updated when the table is changed.
A typical chain of events that occur when accessing a database starts with obtaining an index key. An index is used to obtain a list of row identifiers matching the index key. If there are no matching row identifiers, then the list is empty. Each row identifier obtained from the index can be any value or set of values that can be used to access a row in the table. Examples are a row index, which is the row number, or a row pointer, which is the row's location in the computer memory. The row data is then accessed by transferring it from wherever it is stored to the where it can be used.
The rows of very large tables are often stored in large arrays of storage devices such as computer disk drives, computer tape libraries, read only memory (ROM) disk libraries, or a combination of different storage devices. Compact Discs (CDs) and Digital Video Discs (DVDs) are examples of ROM disks. The disks and tapes are also known as physical volumes. To access a row, the physical volume, or volumes, holding the row must be accessed.
Database owners trade access speed for expense. A row in a table can be accessed quickly if the physical volume can be accessed quickly. Some technologies, such as a computer's solid-state memory, can be accessed extremely quickly but is far more expensive than other types of memory. Other technologies, such as a computer's hard disk drive, are less expensive than solid-state memory and are slower. Computer tape is very inexpensive and very slow.
Some exceptionally large tables are stored in computer tape libraries. An index key is used to find a row and a physical volume. The physical volume, a computer tape, is located and loaded into a computer tape drive. The computer tape then streams through the computer tape drive until the desired row is reached. The row data is then copied to a faster memory type such as a computer disk drive or solid-state memory. As can be imagined, this exceptionally large table is exceptionally slow to access.
Data compression is a technology for fitting data into a smaller amount of physical memory. The key to data compression is that there is a difference between information and data. A series of a billion zeros is a lot of data, but has little information. Based on formatting, a single page can carry 5,000 legible characters. A series of a billion zeros can consume 200,000 pages. The phrase “a series of a billion zeros” takes 6 words for a total of 22 characters. As such, 200,000 pages of data can be compressed into 22 characters of information.
Those skilled in the art of data compression know many data compression algorithms including zip, LZW, RLE, differential coding, GIF, JPEG, and many others. Every algorithm has different properties and different algorithms work better on different types of data.
Table rows can be compressed to consume less physical memory. Compressing all the rows in a table can result in fewer physical volumes storing the table. It can also result in a faster more expensive memory type becoming feasible for storing the table. Another advantage of compressed data is that it can be transferred faster.
Any improvement in table data compression over current technology enables faster access without increasing costs or enables similar performance at a lower price.
Based on the foregoing it can be appreciated that in order to overcome the shortcomings of the current methods and systems a need exists for an improved method and system for compressing the data in a database table.