The present invention relates generally to data compression and decompression techniques, and more particularly to methods and systems for data compression and decompression which afford high coding and storage efficiency and fast retrieval of data from very large data bases while preserving full information content.
Data processing applications which utilize very large data bases pose a number of problems. This is particularly true of real time data processing applications which require multiple accesses to very large data bases and fast data retrieval. Such real time data processing applications generally have been hampered by the rather slow data access time of most mass data storage mediums. Although significant advances in the data storage capacity of mass data storage devices such as magnetic tape, hard magnetic disks and optical disks have occurred, the rather slow data access time of such devices has been a limiting factor on the speed of real time data processing applications which require such devices for storing large data bases.
An example of a real time data processing application which requires multiple accesses to a very large data base and fast data retrieval is the automatic sorting of mail. In order to process efficiently the huge volume of mail handled by the U.S. Postal Service, automated sorting systems have been developed, and are under development, for sorting mail automatically. Multiline optical character readers exist which are capable of recognizing and reading the address from a piece of mail. The read address may then be used to control an automatic sorting machine which sorts the mail according to its destination. The principal component of the address employed for sorting is the ZIP Code. There are approximately 42,000 postal zones in the United States, each of which has been assigned an unique five digit ZIP Code. After reading the ZIP Code, ink jet printers may be used for printing a corresponding bar code on the mailpiece. Bar code scanners read the bar code and sort the mailpiece accordingly.
In order to facilitate the handling of mail, it is desirable to sort the mail automatically down to a much lower level than the postal zone level. To accomplish this, the U.S. Postal Service has developed a system of four digit add-on's which subdivide a postal zone into much smaller areas. This produces a nine digit ZIP+4 Code of the form 123456789, where "12345" is the five digit ZIP Code, and "6789" is the four digit add-on. The first two digits of the add-on "67" are referred to as the sector, and the last two digits "89" are referred to as the segment.
Every address in the United States has been assigned a four digit add-on, and the assignments have been published in a ZIP+4 National Directory comprising some 47 volumes, each about the size of a medium-sized telephone directory. An add-on may designate addresses within one or more blocks of a street, even or odd-numbered addresses on one side of a street within a block, a particular address within a block such as an office or apartment building, a particular floor of a building, or even a firm within the building. Add-on's have also been assigned to rural route, post office box, general delivery, and postmaster addresses. The objective of the system of add-on's is to enable mail to be sorted automatically to a low level. The use of ZIP+4 Codes by mailers is voluntary, and most mail does not have the add-on. Accordingly, automatic sorting requires a data processing operation to retrieve the appropriate add-on for bar coding onto the mailpiece along with the ZIP Code.
The ZIP+4 National Directory File comprises around 25 million records and requires a storage capacity of the order of 1,300 megabytes (MB) to store the source data necessary for retrieving add-on's for postal addresses. A mass data storage device is required to store a data base of this size. Unfortunately, the data access time of such devices is too slow to permit their use with current optical character readers, which are capable of processing twelve pieces of mail each second, especially since multiple accesses are needed, and a much faster data storage medium is necessary to support the real time on-line retrieval of add-on's. Although semiconductor memory is fast enough for such real time processing, it is impractical to use semiconductor memory to store a data base of this size.
Even where fast access time is not a necessity, the size of the ZIP+4 National Directory data base presents other problems. It is advantageous, for example, for commercial mailers to use ZIP+4 add-on's, and the U.S. Postal Service has made the ZIP+4 National Directory available on optical disk, for example. Although optical disks and readers are relatively inexpensive, the data access time for optical disks is far too slow even for most commercial mailers. Hard disks have data access times which are about 300 times faster than optical disks. However, a hard disk for storing a data base the size of the ZIP+4 National Directory is relatively large and expensive, and requires a somewhat sophisticated computer system, which may be more than a small business user can afford.
The problems of size and access time can both be addressed by reducing the size of the ZIP+4 National Directory data base to one which is capable of being stored in a data storage medium having fast access time to support real time processing operations, such as the automatic sorting of mail, and to a size which is amenable to being stored in a less sophisticated and less expensive system so that it is more available to small users. It is to these ends that the present invention is directed.