The present invention relates to a digital video transcoder architecture.
The following acronyms and terms are used:
ACxe2x80x94Alternating Current (DCT coefficient)
ALUxe2x80x94Arithmetic Logic Unit
Bxe2x80x94Bidirectionally-predictive coded (MPEG)
Back.xe2x80x94Backward (MPEG)
BSPxe2x80x94Bitstream Processor
CBPxe2x80x94Coded Block Pattern (MPEG)
CBRxe2x80x94Constant Bit Rate
Chan.xe2x80x94Channel
CPUxe2x80x94Central Processing Unit
D$xe2x80x94Data cache
DCxe2x80x94Direct Current (DCT coefficient)
DCTxe2x80x94Discrete Cosine Transform
DMAxe2x80x94Direct Memory Access
DPCMxe2x80x94Differential Pulse Code Modulation
DPRAMxe2x80x94Dual Port RAM
DSxe2x80x94Data Streamer
DTSxe2x80x94Decode Time Stamp
FIFOxe2x80x94First-In, First-Out
FIRxe2x80x94Finite Impulse Response
FPGAxe2x80x94Field-Programmable Gate Array
Fwd.xe2x80x94Forward (MPEG)
H/Vxe2x80x94Horizontal/Vertical
HWxe2x80x94Hardware
Ixe2x80x94Intra-coded (MPEG) or Integer
I$xe2x80x94Instruction cache
I/Fxe2x80x94Intermediate Frequency
I2Cxe2x80x94IICxe2x80x94Inter-integrated circuit (a serial bus standard)
I2Sxe2x80x94IISxe2x80x94Inter-IC sound (a 3-wire digital stereo PCM audio interconnect)
iBlkxe2x80x94Input block FIFO
IDCTxe2x80x94Inverse DCT
IECxe2x80x94International Electro-mechanical Commission
IFGxe2x80x94Integer, Floating point and Graphics (as in IFG-ALU); an Equator acronym
IGxe2x80x94Integer, Graphics unit
iMBxe2x80x94Input macroblock FIFO
Info.xe2x80x94Information
Int.xe2x80x94Interface
iRBxe2x80x94Input rate buffer
ISRxe2x80x94interrupt service routine
ITUxe2x80x94International Telecommunications Union
JTAGxe2x80x94Joint Test Action Group (IEEE 1149.1 protocol)
KBxe2x80x94Kilobyte
LRUxe2x80x94least recently used (a cache line replacement algorithm)
MAPxe2x80x94Media Accelerated Processor (Equator)
MBxe2x80x94Megabyte or Macroblock
MCxe2x80x94Motion Compensation
MTSxe2x80x94MPEG Transport Stream
MUXxe2x80x94Multiplexer
NCxe2x80x94Non-coherent Connect
NOPxe2x80x94No Operation
NTSCxe2x80x94National Television Standards Committee
oBlkxe2x80x94Output block FIFO
oMBxe2x80x94Output macroblock FIFO
oRBxe2x80x94Output rate buffer
Pxe2x80x94Predictive-Coded (MPEG)
PCRxe2x80x94Program Clock Reference (MPEG)
PESxe2x80x94Packetized Elementary Stream (MPEG)
Pic.xe2x80x94Picture
PIDxe2x80x94Packet Identifier
PTSxe2x80x94Presentation Time Stamp (MPEG)
QLxe2x80x94Quantization Level
RAMxe2x80x94Random Access Memory
RAMDACxe2x80x94RAM Digital-to-Analog Converter
Ref.xe2x80x94Reference
Reg.xe2x80x94Register
RGBxe2x80x94Red Green Blue
RISCxe2x80x94Reduced Instruction Set Computer
RLxe2x80x94Run Length (or run-level pair)
ROMxe2x80x94Read-Only Memory
RTOSxe2x80x94Real-Time Operating System
Rxxe2x80x94Receiver
SAVxe2x80x94Start of Active Video
SCxe2x80x94Start Code
SDRAMxe2x80x94Synchronous Dynamic Random Access Memory
SGRAMxe2x80x94Synchronous Graphics Random Access Memory
SIMDxe2x80x94Single Instruction, Multiple Data
Svc.xe2x80x94Service
SWxe2x80x94Software
T1xe2x80x94A US telephony digital line standard with a data rate of 1.544 Mbps
TCIxe2x80x94Transport Channel Input
TLBxe2x80x94Translation Lookaside Buffer (part of MMU)
TMCxe2x80x94Transcoder Multiplexer Core
TPExe2x80x94Transcoder Processing Element
T-STDxe2x80x94Transport System Target Decoder
tVLDxe2x80x94Transcoder VLD
tVLExe2x80x94Transcoder VLE
Txxe2x80x94Transmitter
VBVxe2x80x94Video Buffer Verifier (MPEG)
VLDxe2x80x94Variable-Length Decoding
VLExe2x80x94Variable-Length Encoding
VLIWxe2x80x94Very Long Instruction Word
The transmission of digital video data, e.g., via broadband communication systems such as cable television or satellite television networks has become increasingly popular. Source video sequences can be pre-encoded at any rate, which may be a constant bit rate (CBR) or variable bit rate (VBR), to form pre-compressed or live bitstreams. For many applications, however, the pre-compressed bitstreams must correspond with only specific allowable, or otherwise desirable, formats and bit rates. Accordingly, a transcoder is used to change the bit rate, format or other characteristics of the video data prior to communicating it, e.g., to a set-top box and/or some intermediate point in a network.
A number of transcoders are often used in a statistical multiplexer that receives a number of compressed bitstreams, decompresses the bitstreams (at least partially), then recompresses them at a different rate by allocating a quantization parameter based on some statistical property of the bitstreams such as picture complexity.
The bitstreams are typically compressed according to a known video coding standard, such as MPEG.
A transmux refers to a combination of multiple single-service video transcoders and a statistical multiplexer that dynamically sets the video bit rate. Usually, this is done to perceptually equalize the quality of the video services within a statmux group, i.e., allocate more bits to services containing a difficult-to-compress video, and fewer bits to services containing easier-to-compress video.
However, the development of a transmux architecture must address various needs. In particular, it would be desirable to provide transmux architecture that is fully software implemented. This provides great flexibility by allowing the transmux functions to be changed in the field (e.g., at cable television headends and the like) by only changing the software rather than the hardware. The architecture is also suitable for use in computer networks such as the Internet.
This is advantageous since it allows upgrading of transmuxes to handle new functions, fix hardware or software problems (xe2x80x9cbugsxe2x80x9d), test new processes, adapt to changing customer requirements, and so forth. These tasks are all easier with a xe2x80x9csoftxe2x80x9d (software-based) transmux.
Moreover, savings in design time and expenses can result.
The transmux should provide a good computational efficiency and allow a relatively small physical size (footprint).
The transmux should be implementable using a readily-available media processor, such as a VLIW media processor.
The transmux should perform full transcoding, including keeping track of frame-to-frame requantization errors.
The transmux should provide scheduling of multiple transcoding threads (including buffer management, processor management, and the like) with combinations of both transcoded video PIDs and pass-thru data/audio services on a single processor without the use of a RTOS.
The transmux should use a load balancing algorithm, specific to the transcoding task, for keeping a VLIW processor and co-processor from waiting on each other, and to obtain high levels of computational throughput.
The transmux should provide decomposition of a transcoding algorithm into components which run on a VLIW processor and a co-processors subject to co-processor memory constraints, instruction cache size and data cache size.
The transmux should provide case-wise specialization of a simplified transcoding algorithm to the different MPEG-2 picture types (I, P or B) and macroblock coding type (intra-coded or inter-coded), and account for whether the quantization step size increases or decreases during transcoding.
The transmux should provide an overall software architecture for a transcoder processor element (TPE).
The present invention provides a transmux design having the above and other advantages.
The present invention relates to a digital video transmux architecture.
A transcoding process in accordance with the invention can be decomposed into the following five steps. A VLIW core processing resource and a BSP processing resource are allocated for the different steps as indicated. The BSP handles the sequential bitstream packing and unpacking tasks. On the MAP-2000CA, the BSP is VLx co-processor 131, which is a VLD/VLE co-processor, although similar devices from other suppliers may be used. The BSP runs multiple transcoding VLE and VLD threads (processing loops). A corresponding architecture is shown in FIG. 1(a).
a) MPEG transport stream decoding (on VLIW core) (10)
b) MPEG video elementary stream variable length decoding (VLD) (on BSP) (20)
c) Core transcoding (on VLIW core) (30), consisting generally of
c.1) inverse quantization (of VLD output)
c.2) spatial error motion compensation (and addition to IDCT out)
c.3) forward DCT of motion compensated error
c.4) add error (now in DCT domain) to inverse quantize results)
c.5) forward quantization (to form the VLE input)
c.6) inverse quantize VLE input and subtract from fwd. quant. input to form cumulative error in DCT domain
c.7) inverse DCT of error to form spatial error. This error is stored in reference frame buffer to be used in future pictures (step c.2) which reference this image.
d) MPEG video elementary stream variable length encoding (VLE) (on BSP) (40)
e) MPEG transport stream encoding (encapsulates the transcoded video elementary streams) (on VLIW) (50).
Thus, the parsing/demux 10 provides transport stream de-assembly, the VLD 20 provides ES de-assembly, the VLE 40 provides ES re-assembly, and the remux/packetization 50 provides transport stream re-assembly.
Generally, the transcoding of a video access unit (coded picture) is accomplished in N computational steps by M processing elements, where the results of certain steps are used as inputs to later steps (the specific steps in our case are listed above). Each of the N steps may be executed on different processing elements subject to the data dependencies imposed by the algorithm.
One of the key concepts of this invention is that each coded picture is represented by a data structure (the Picture structure) which is passed between the processing elements. Queues of pictures exist at the input to each function (step) which may be serviced by one or more processing elements.
A particular transcoder apparatus in accordance with the invention includes at least a first transcoder processing element (TPE) for receiving input channels of a transport stream comprising compressed digital video data, and first and second processing resources associated with the first TPE for providing data de-assembly, core transcoding, and data re-assembly. Additionally, a queue is associated with the first processing resource for queuing data received from the second processing resource prior to processing at the first processing resource, and a queue associated with the second processing resource for queuing data received from the first processing resource prior to processing at the second processing resource. The first and second processing resources are operable in parallel, at least in part, for providing the data de-assembly, core transcoding, and data re-assembly.
Single or multi-threaded implementations may be used for the VLIW and BSP processors.
A corresponding method is also presented.