This invention relates to the field of data compression, data decompression, data adaption to a data environment, and to the field of creating, managing and optimizing a data structure and its contents. The invention may also have uses in the fields of data recognition and artificial intelligence.
Data compressors read an input stream of symbols and after reading an input symbol or group of input symbols append one or more output codes (xe2x80x9ccompression code wordsxe2x80x9d) to an output stream (xe2x80x9ccompressed streamxe2x80x9d). The output code or group of output codes represent the input symbol or group of input symbols.
An output code may or may not have the same bit pattern as the last-read input symbol. The quantity of input symbols in an input stream may or may not equal the quantity of output codes in a corresponding compressed stream. When the quantity of bits in a compressed stream is less than the quantity of bits in the corresponding input stream compression is achieved. In a given instance, a compressor may or may not achieve compression.
A decompressor reads a compressed stream and after reading one or more codes in a compressed stream transmits a symbol or group of symbols to an output stream (xe2x80x9cdecompressed streamxe2x80x9d). In lossless compression, the bit pattern of a decompressed stream equals the bit pattern of the original input symbol stream.
If the quantity of codes in a compressed stream equals the quantity of symbols in the corresponding input stream, compression is achieved when the average bit length of output codes is less than the average bit length of input symbols. Output codes may be of invariant or varying bit length and the same goes for input symbols.
If the quantity of codes in a compressed stream does not equal the quantity of symbols in the corresponding input symbol stream, compression is achieved when the quantity of bits in the compressed stream is less than the quantity of bits in the input symbol stream. In such a case there may be more or fewer codes than symbols, and in general there are fewer.
Some compression-decompression (xe2x80x9ccodecxe2x80x9d) systems compress contiguous repetitions of a repetitions. Other codec apparatus does not encode contiguous repetitions of a symbol but assigns to each symbol a code of bit-length inversely proportional to the frequency of occurrence or anticipated frequency of occurrence of the symbol in the input symbol stream. A further type of codec system builds a dictionary of repeated groups of symbols previously found in the present input stream, and where a further group of symbols in the input symbol stream matches to a group of symbols in the dictionary, the dictionary index of that symbol group or its location in the earlier part of the input stream is output as the compression code word. The rules used to compress an input symbol stream and decompress a respective compressed stream are often referred to as a xe2x80x9ccompression modelxe2x80x9d.
Codec systems and apparatus may be further characterized as static and adaptive, and static systems use a compression model which is invariant during a compression session and in adaptive systems the model is dynamically modified by the compressor as a function of the symbols encountered so far in the current input symbol stream. Adaptive systems may provide better compression than static systems but not necessarily at lower cost.
For example, when codec systems were first used in computers, computer processing time was very expensive and dictionary-based compression systems typically stored dictionaries for static re-use with later input symbol streams. Today, computer power is much cheaper, and now, typically, adaptive codec systems build dictionaries separately from each input stream which are then discarded after decompression and sometimes after compression. In some cases a dictionary is implicitly embedded in a compressed stream, and in other cases one is transmitted as a header to the transmission of the compressed stream to which it relates.
The objects of codec systems are reduction of information storage space, reduction of information transmission time, and consequent reduction of information processing cost.
Codec systems now common in personal computing may achieve these objectives, increasing available disk space and decreasing data transmission time from disk surface to application program. Furthermore, while digital images typically occupy more storage space than their analog couterparts, compressed digital images may occupy less space and achieve shorter transmission times, and this has important implications in digital storage and transmission over telephone links of motion pictures, which are a sequence of still images.
In order to achieve acceptance in a market place, a codec system typically must meet certain standards compared to its competitors. It should have good compression and decompression speeds, which are a function of the times required for compression and decompression. It should have a high compression ratio, which is a measure of how much space or transmission time is saved as a consequence of compression. It should be capable of adapting to different data environments, which means taking into account changes in the general qualities of data previously received, and increasing speed and compression ratio accordingly. And a lossless codec system must be reversible, which means that the bit pattern of a decompressed stream must be identical to the bit pattern of the respective input symbol stream.
Prior codec systems exist which exhibit the characteristics mentioned above, however, prior dictionary-based codec systems typically build a dictionary in respect of a current input stream which might be one file or one archive or one session, and discard the dictionary after the respective compressed stream is decompressed or even after compression. This has the disadvantage of failing to compress groups of symbols which occur infrequently in the current input stream but which are commonly repeated in input streams in general.
Furthermore, such methods have the disadvantage of failing to compress groups of symbols which typically occur infrequently in input streams in general but which typically do occur in input streams, and when a number of input streams are considered together as a block, do occur frequently within a block.
Moreover, because the adaptivity of prior codec systems typically applies in respect of a current input stream, such systems cannot optimize compression in a network environment where there are many input symbol streams, and where optimization requires identifying and adapting to repeated symbol groups amongst the network traffic as a whole, and retaining and adapting to such information over time.
In addition, prior lossless codec systems typically encode all information in an input symbol stream into a compressed stream or compressed stream plus compression header, and transmit all such information together. Such transmissions contain the entire information content of the original input symbol stream. If the transmission is intercepted and the codec algorithms known, guessed or discovered then the intercepted transmission may be decompressed and the original symbol stream recovered. This is not ideal in today""s sensitive business world. It would be better that some information in an original stream were not transmitted in the corresponding compressed stream. This would partly or completely prevent unauthorized decompression where only the compressed stream is in the possession of an interceptor. Were such absent information to change in character and quantity in an unpredictable way over time, and were such changes to be unique both in content and in manner of change to a given network, this would be even more advantageous in a competitive commercial world.
It is held that to store, update and re-use a dictionary would render a codec system uncompetitive, as stored dictionary entries would not be typical of input streams in general and the average compression ratio would suffer, and if the stored dictionary entries were typical of input streams then the dictionary would be so large that the time required to match a given input symbol group or decompress a given code word would increase processing time unacceptably. Moreover it is held that because of the large size and correspondingly slow compression and decompression speeds of such a dictionary, real-time compression and decompression over a communication link would not be practical.
It is argued, furthermore, that such a dictionary because of its large size could not cost-effectively be transmitted with the compressed stream, and therefore exactly the same dictionary should necessarily pre-exist at each end of a transmission, and this is not ideal.
It is generally asserted by those skilled in the art that there are limits to the ability of codec systems to increase network communication bandwidth, and that this limit now has been reached.
No known prior compression system uses a dictionary which adapts to and retains dictionary content from a plurality of input streams, which may be used interactively in real time over a communication system, and which overcomes the present perceived limitation to the bandwidth of information transmission.
In the field of hand writing, image, voice and other forms of data recognition, which is part of the field of artificial intelligence, relatively large amounts of information are stored in a compressed form and a match or approximate match is sought between an instance of a data type, for example, an image, and the stored information. Prior data type recognition systems have in general not proved to be fast.
The present invention goes some way towards overcoming the failures of the codec systems described above and provides a relatively fast and reversible codec method, apparatus and data structure with a persistent, resident, broadly adaptive dictionary, with optional supplementary dictionary. The dictionaries may be built from a plurality of input streams and optionally previously compressed streams, and may be employed in batch mode or real time over a communication system to compress and decompress information. The invention may be employed in the field of artificial intelligence, including data recognition, where data is retained in compressed form. When so employed, the present invention provides relatively fast access to compressed data in many cases.
In one aspect the invention provides a method and system for adapting a connection structure forming part of a dictionary in a computer memory device, and a method for adapting the entire dictionary.
In a further aspect the invention provides a method of enabling compression and decompression of symbol streams transmitted between two or more devices, such as a server and client devices in a network. A system including the devices is also provided.
In another aspect the invention provides a method and system for creating a dictionary for use in compression or decompression, by adapting the dictionary by way of additive or change related processes.
In a still further aspect the invention provides a dictionary containing both linked lists and binary search lists. In a yet further aspect the invention provides a method of operating a shift register for greater processing speeds as dictionaries are accessed.
Further aspects of the invention will become evident from the accompanying detailed description and drawings.
By way of example of compression, when an input stream of symbols which contains an instance of such a symbol group is received for compression, the index of the group, which is typically the address of the connection in the dictionary which represents the group, is stored or transmitted as the compression code word. For example, if the symbol group xe2x80x9cing andxe2x80x9d is received and it is represented in the dictionary at connection address 12345, and no larger input symbol group is found in the dictionary which includes the symbol group xe2x80x9cing andxe2x80x9d, then the number 12345 is transmitted as the respective compression code word. In the preferred embodiment, a connection address is a shifted virtual memory offset from near the start of the dictionary.
The present invention may be used in a variety of ways whose primary utility may not be limited to or may not relate to those described herein. The purpose or use of the present invention is therefore expressly not limited to the purpose and use exemplified in the present embodiment. The purpose and use of the present invention may form a sub-process of a further purpose and use including the purpose and use of data recognition systems.