1. Field of the Present Invention
The present invention relates to compression techniques for dictionary databases.
2. Background of the Prior Art
While a great deal of work has been done in the area of text compression in general, relatively little work has been done on dictionary compression in particular. Many of the more general techniques used in text compression can be applied to dictionaries. Dictionaries, have certain properties, however, which allow methods of compression to be used which have no general use in text compression. Examples peculiar to dictionaries are that inflections are similar to their root words and entry words are stored alphabetically.
Reference is made to two references in the literature regarding compression:
[1] D. B. Conyes and J. M. Gilly, "Reversible Dictionary Compression by Reference to Redundant Components of Vocabulary", IBM Technical Disclosure Bulletin, pp. 4133-4135, January 1982; and PA1 [2] James A. Storer, "Data Compression: Methods and Theory", Rockville, Md., Computer Science Press, 1988.
For a useful dictionary product, any compression technique must also address the problem of retrievability. It is not sufficient to simply compress the data as much as possible; the individual articles that make up a dictionary (an article being defined as the main entry, its associated inflections, parts of speech, definitions and other related information) must also be readily accessible.
Previous dictionary compression techniques take advantage of the redundancies found in dictionaries by forming rules. For example, a rule is applied where a plural is formed from the root word by adding an "s". The rule might also indicate that the part of speech for the root and its plural is a noun.
Several problems exist with this known method. Firstly, creating an exhaustive set of rules is tedious and is necessarily specific to the data. Given a new dictionary database, one may have to modify the rules. In addition, "unrules" must also be written to allow retrieval of the root. "Unrules" however, can give false entries. For example, an "unrule" might say that "apartments" inflect to the root word "apart", or that "digest" is the superlative form of "dig". These exceptions will adversely affect the compression.
Further, many words take on more than one part of speech. These parts of speech may take the form of separate articles called homographs. The complication that arises is that words may inflect to an invalid entry. For example, "recorded", the past tense of record, does not inflect to the noun "record". Accounting for all of these kinds of exceptions is a very difficult task.
A principal application for dictionary database compression is in portable electronic reference products. Such products typically store the database information on read-only-memory (ROM) chips. These chips represent a significant portion of the cost of electronic reference products. If the storage medium is hard-disk, the cost of data storage may be significant.
A known arrangement for electronic compression is set forth in FIG. 2. That arrangement 10 includes a microprocessor 11, a read only memory (ROM) 12, a random access memory 13, a display 16, a display controller 17, a keyboard 14 and a keyboard controller 15. The ROM 12 has stored therein the dictionary database and microprocessor instructions. The RAM 13 functions as a scratch pad. Control, data and address buses 18, 19 and 20 interconnect the various portions. The data bus is used to transfer data. The address bus is used to by the microprocessor to select a particular byte of data. The control bus is used by the microprocessor to select a particular device so that the device can share the same address and data bus. This figure is intended to represent one form of apparatus which may be used with the present invention and is not intended to restrict application of the invention in any way.