The present invention relates generally to data compression techniques, and more specifically to a text compression technique that is suitable for compressing text databases having a large number of different words.
Computer users are coming to expect large text databases at their fingertips, and this trend will continue as portable and hand-held computerized devices become more common. Compression ratio is a major issue, since many of the reference works that are candidates for being made available are rather long. A separate issue is whether the text has to be decompressed in order to search for particular strings. Many techniques are efficient in terms of compression ratio, but are context sensitive (i.e., searching the text requires decompression and decompressing one part requires decompressing other unrelated parts). This extra decompression step may slow down the search process to an unacceptable degree, thereby necessitating the use of faster (more expensive) hardware components.
One technique that has shown considerable promise and that has found its way into a number of commercial products in the last few years uses a data structure referred to as a word-number mapper ("WNM"). The WNM can be regarded conceptually as a table wherein each distinct word and punctuation mark, referred to as a token, in the database has an associated unique number. The table thus provides a unique bidirectional word-number mapping. The actual WNM is implemented as a highly compressed data structure.
Compression of the database occurs generally as follows. First the text is parsed into a sequence of alternating word and punctuation mark tokens, which are accumulated to build up the table. The table only includes one entry for each token type, no matter how often that type occurs in the text database. In a second pass, the text is again parsed into tokens, and each token is encoded by its unique number in the table. In one regime, most of the tokens are encoded as a two-byte number, but some are encoded as a single byte. By way of specific example, the Bible was found to have approximately 13,000 different token types while the Physician's Desk Reference was found to have approximately 25,000 different token types.
Decompression is the inverse process to compression. The numerical codes in the compressed string are converted to the corresponding uncompressed tokens using the same WNM that was used to compress the text in the first place. Account must be taken of the manner that the string of numerical codes was stored in the compression process.