1. Field of the Invention
The present invention relates to the processing of digital signals to effect variable-length coding (VLC) wherein execution transpires over multiple data concurrently.
2. Description of Related Art
Much of the putative parallel-processing art in VLC and its subclasses, such as Huffman coding, is labeled as such despite lacking the ability to process multiple VLC data units simultaneously. Instead, acting on just one unit at a time, a result is produced in an accelerated course by examining several limited aspects of the problem at once. True parallel processing is rare in the art as a result of a fundamental incongruity between VLC and parallelism.
The digital signal processing technique, VLC, entails the translation of multiple-bit units of data between fixed-length and variable-length representations, often for the purpose of achieving an aggregate reduction in data in the variable-length domain. The fixed-length units, termed symbols, conveniently partition unencoded data. A symbol's unique variable-length encoded counterpart, the codeword, serves the role of an intermediary, facilitating storage and/or data transmission until it may be decoded back into a usable symbol. In canonical form, a VLC system comprises an encoder, a decoder and an intervening channel providing unidirectional communication of coded data from the encoder to the decoder. An encoder generates a contiguous stream of codewords of uneven extents in ordered correspondence with the input symbol stream, while a decoder reverses this process.
In VLC, length is either an embedded or an implicitly associated attribute of each codeword. Therefore, apart from knowledge internalized with each codeword, the coded stream has no discernible demarcation between codewords. Only by accounting for the span of all codewords succeeding a known reference can a subsequent codeword boundary be deduced. Hence, a serial dependency is established on the process of resolving codewords within a contiguous stream.
As a general principle, parallel processing requires a method of separating data into independent elements for distribution among multiple processing elements (PEs). Inasmuch as the data fail to be independent, processing will be subject to serial dependencies. In VLC, irregular stream delimitation precludes a means of selecting independent data prior to ascertaining the extent of each preceding codeword. Unfortunately, determining codeword lengths is tantamount to decoding if length information is implicit and nearly so if it is embedded. As a result of this dilemma, fully concurrent processing has not heretofore been achieved in a VLC context.
The article, H. D. Lin and D. G. Messerschmitt, "Designing a High-Throughput VLC Decoder Part II", IEEE Transactions on Circuits and Systems for Video Technology, pp. 197-206, 1992, proposes forcing codeword alignment at fixed intervals. In effect, the codeword stream is divided into equal-sized multiple-codeword blocks in which the first codeword of each block is justified with the block boundary. Rather than processing individual codewords in parallel, the larger blocks are distributed among PEs. Where the final codeword would otherwise extend beyond the block boundary, it is simply detained, becoming the first codeword of the subsequent block. As a result of this practice, voids of varying length are introduced at the end of a majority of the blocks. These lacunae in the coded transmission clearly contravene the purposes of the principal motivation for VLC, data compression. If blocks are made longer to amortize this compressive inefficiency, more buffering memory is needed. In general, a prohibitive quantity of memory is required to store the large blocks central to this scheme. Moreover, as a result of this buffering, a long latency is incurred between the arrival of input and the emission of output.
In the same article, a variation on the foregoing method is discussed in which several multiple-codeword blocks are again distributed among as many PEs. However, no attempt is made to align codewords with block boundaries. Instead, each block is examined over many ramifications resulting from all possible codeword initiation sites within the block. Problem growth is typically substantial since the lengthiest codeword dictates the number of potential codeword origination points. The correct decoding among all alternatives is identified upon receipt of the terminal location within the previous block. Hence, serial dependence between codewords, while not completely eliminated has been deferred to the final steps. Although no compression-rate overhead is suffered by this second approach, as the consequence of the trial processing of so many ramifications, memory requirements are many multiples of the preceding memory-intensive approach. Furthermore, all but one of the ramified processing courses undertaken for each block is discarded, resulting in exorbitant computational waste. Within each concurrent block either each course is simultaneously computed through a more deeply nested level of concurrency, thus multiplying hardware excessively, or each course is dealt with in turn slowing computation proportionally. In the latter circumstance, an exceptionally high parallelism factor exceeding the longest codeword length is necessary to derive any performance benefit at all from the approach.
In U.S. Pat. No. 5,841,380 a similarly ramified approach to neutralizing serial dependency is suggested, but on the scale of individual codewords rather than blocks. In that exemplar, two codewords are decoded in tandem, with the first processed in the normal fashion and the second, whose locality within the stream is yet unidentified, processed from all potential initiation sites in parallel. Upon the conclusion of decoding, the correct result for the second codeword is isolated from among the many provisional decodings in accordance with the discovered length of the first codeword. Although the approach might have been extended to more than two codewords, as was suggested in the previously described block-based method, the restriction to just two codewords in the case cited is in keeping with the undeniably impractical growth in complexity attending such ramified methodologies.
U.S. Pat. No. 5,202,967 attempts to circumvent ramifications by placing format restrictions on every other codeword such that, in a two-processor system, the second processor can identify and thus decode the second codeword without reference to the first. This special format violates the basic principles of VLC, however.
Also in the aforementioned article, as well as in the article, K. K. Parhi, "High-Speed VLSI Architectures for Huffman and Viterbi Decoders," IEEE Transactions on Circuits and Systems II, vol. 39, no. 6, 1992, the decoding problem is recast in a circuit-dependency framework known as the finite state machine (FSM). Exploiting known parallelizing transformations on FSMs, the computation can be expanded over duplicate nodes, producing a concurrent process. Since codewords are involved directly in the formulation of the FSM, the resultant hardware is not programmable. Thus, the FSM design process must be repeated and new hardware fabricated for each new code. As intimated in the former citation and admitted in the latter, the approach is chiefly of academic interest since it is only applicable for symbols drawn from artificially diminutive sets. Furthermore, as demonstrated in the latter article, the manipulations required to arrive at the final result, and accordingly the final result itself, are quite complicated.
In the article, P. G. Howard and J. S. Vitter, "Parallel Lossless Image Compression Using Huffman and Arithmetic Coding," in Proceedings of the Data Compression Conference, pp. 229-308, Snowbird, Utah, March, 1992, block processing is again suggested. In this case, each codeword in a block is disassembled bit by bit, commencing with the first bit and proceeding to the last. All the first bits are juxtaposed, followed by all the second bits and so on. These are distributed among multiple decoding PEs according to the state of decoding. Short codewords are decoded before longer codewords. PEs processing such codewords can signal their completion, avoiding the receipt of latter bits. If the number of PEs is less than the number of codewords found within a block several codewords may be allocated to each PE. Those PEs that decode an allotment of short codewords may obtain codewords from PEs with a backlog, thereby distributing the computing load. Toward the end the block, it is expected that the number of unclaimed codewords will be insufficient to keep all PEs occupied. Thus, provision is made for idle PEs to be disabled. Unfortunately, workload distribution and PE disablement are not trivially implemented tasks. Worse, workload distribution introduces serial dependency, which interrupts parallelism. As well, PE disablement merely extenuates, but does not eliminate wasted computation cycles. A further significant difficulty with this technique is its reliance on parallel random-access memory (PRAM), a bulky and complex arrangement which severely impairs parallel operations in practice. Besides a reduction in performance, the use of PRAM enforces undesirable limits on parallel scalability, i.e., the factor by which PEs may be multiplied.
U.S. Pat. No. 5,623,262 offers a scheme for decoding multiple codewords simultaneously without concurrent hardware. This involves recasting the mapping between symbols and codewords. Specifically, multiple-symbol amalgams are associated with multiple-codeword amalgams constituted in original correspondence with the individual symbols of the symbol amalgam. Consequently, a codeword amalgam instantly decodes to as many symbols as are encoded therein. The drawback of this method is clear; for practical benefit the recast mapping must be made dramatically larger than the original. Furthermore, the quantity of symbols recovered at each step varies.
Beyond the preceding approaches, two prevalent variations on parallelism in the art are found. The first, as broached previously, involves the bringing several operations to bear on a single codeword simultaneously to speed the production of the result. This style of processing is represented in U.S. Pat. Nos. 5,032,838, 5,757,295, and 5,798,719. Since these do not endeavor to process multiple codewords in unison, their ostensible parallelism is unrelated to the full inter-codeword parallelism of the present invention.
The second style of parallelism superficially appears more material, but is actually quite distinct. In those approaches, multiple independent symbol streams are transformed by as many encoders into distinct codeword streams. These may or may not be coalesced into a single stream during the interim between encoding and decoding. Regardless, by the time these streams are redistributed to their corresponding decoders they are once again separate independent codewords streams. Of course, multiple independent symbol streams are recovered at the decoders. The key to parallelism of this kind lies entirely in the independence of the streams. In fact, there is no novelty in the parallelism of the foregoing at all. Consider that there may be, at any one time, legion unrelated VLC systems around the globe processing undeniably independent data. The simple collocation of two or more such disparately purposed systems on the same silicon chip does not make them parallel in any more informative sense of the word than contained in the assertion that two workers performing different duties in the same office are working in parallel. This is not to say that methods incorporating this concept do not possess other novel attributes, though.
Most of the art relying on the independent-stream mode of parallelism is unique by virtue of the way separability and reassembly is obtained for data speciously held to be monolithic. For example in U.S. Pat. No. 5,379,070, JPEG images are sliced vertically into separate, independent data segments. To facilitate synchronization and reassembling, various fiducial data are interposed on these segments. These segments may then be VLC encoded, transmitted and decoded independently. An elaborate process of reconstruction, unrelated to VLC coding, then ensues. U.S. Pat. No. 5,568,139 focuses on image data as well. The invention is particularly premised on the feasibility of dividing raster images along transverse boundaries, each segment thereby consisting of a fixed quantity of image lines. These being distinct are manipulated in the independent-stream mode of parallelism after being distributed among multiple decoders. Because the method of separating image blocks is simple, so too is the reconstruction process. Another invention employing independent-stream-derived concurrency is found in U.S. Pat. No. 5,654,703 where the content of the data is not constrained, but the requirement of independence is maintained. The author calls upon the example of the separate color components, luminance, chrominance and saturation as three independent data sets for which the invention would be particularly well suited. The unique feature of this invention is found in an involuted manner of blending distinct encoded data into a unified stream and, accordingly, extracting these again into separate streams. In fact, this process suffers from contention which adversely impacts performance.
U.S. Pat. No. 5,808,752 demonstrates a very basic application of independent-stream parallelism for a two-sided image reading apparatus. In this case, data from the top side of a page constitutes one stream while data from the bottom forms the other. These are obviously independent data.