The present invention is related to data structure compression. More specifically, the present invention relates to the compression of linguistic data structures for natural language translation systems.
Natural language translation systems process and manage thousands of words, phrases, and sentences. To process and manage such vast amounts of data, linguistic data structures are used. These linguistic data structures not only store words, phrases, and sentences, but may also store associated qualifiers in order to process and manage the data more efficiently. Consequently, the large number and large size of linguistic data structures require large amounts of memory. Thus, a goal of natural language translation systems is to reduce the amount of memory for storing the linguistic data structures during language translation.
Conventional data compression techniques, however, are not well suited for compressing linguistic data structures for natural language translation systems. For example, a common data compression technique is the Lempel-Ziv (LZ) method. The LZ method exchanges recurring substrings automatically in straight text with references to the substrings according to a longest-match algorithm. Although the LZ method provides a comparatively high compression ratio, the LZ method is not well suited for natural language translation systems because natural language translation systems require fast and random access to any compressed text to perform language translation. In order for the LZ method to access the compressed text rapidly and randomly, the text must be entirely decompressed, which results in a performance penalty. Because natural language translation systems require fast and random memory access, it is not suitable to use the LZ method for compressing linguistic data structures.
Another conventional data compression technique is the dictionary method. The dictionary method references and stores redundant tokens in a separate dictionary. The tokens are chosen by human interaction. In this technique, the data may then be compressed by exchanging each instance of the token with a reference to the dictionary. The dictionary method requires human interaction in determining which substrings are to be referenced with tokens. For natural language translation systems, requiring human interaction is not feasible for compressing linguistic data structures because of the large amounts of data involved. Thus, what is required is a method to compress recurring segments within a data structure with an index to the segment while allowing fast and random access to the data structure.
A method and system for reducing the amount of memory used while allowing fast and random access to linguistic structures are described. In one embodiment, at least one segment within a data structure is identified. Each identified segment is counted to determine a number of occurrences of the identified segment within the data structure. Also, if the number occurrences is greater than one, the segment is saved in a recurring data structure and the segment is replaced in the data structure with an index corresponding to the segment in the recurring data structure.