This invention relates to a technique for processing a digital voice signal, in the fields of application of packet voice communication and digital voice storage. More particularly, the invention relates to a data embedding technique in which a portion of encoded voice code (digital code) that has been produced by a voice compression technique is replaced with optional data to thereby embed the optional data in the encoded voice code while maintaining conformance to the specifications of the data format and without sacrificing voice quality.
Such a data embedding technique, in conjunction with voice encoding techniques applied to digital mobile wireless systems, packet voice transmission systems typified by VoIP, and digital voice storage, is meeting with greater demand and is becoming more important as an digital watermark technique, through which the concealment of communication is enhanced by embedding copyright or ID information in a transmit bit sequence without affecting the bit sequence, and as a functionality extending technique.
The explosive growth of the Internet has been accompanied by increasing demand for Internet telephony for the transmission of voice data by IP packets. The transmission of voice data by packets has the advantage of making possible the unified transmission of different media, such as commands and image data. Until now, however, multimedia communication has mainly been transmission independently over different channels. Further, though services through which telephone rates for users are lowered by the insertion of advertisements and the like are also available, such services are provided only at the outset when the call is initiated. In addition, by transmitting voice data in the form of packets, different media such as commands and image data can be transmitted in unified fashion. Since the transmission format is well known, however, a problem arises in terms of concealment of information. With this as a background, digital watermark techniques for embedding copyright information in compressed voice data (code) have been proposed.
In order to raise the efficiency of transmission, voice encoding techniques for the highly efficient compression of voice have been adopted. In particular, in the area of VoIP, voice encoding techniques such as those compliant with G.729 standardized by the ITU-T (International Telecommunications Union-Telecommunications Standardization Sector) are dominant. Voice encoding techniques such as AMR (Adaptive Multi-Rate) standardized by 3GPP (3rd Generation Partnership Project) have been adopted even in the field of mobile communications. What these techniques have in common is that they are based upon an algorithm referred to as CELP (Code Excited Linear Prediction). Encoding and decoding schemes compliant with G.729 are as set forth below.
Structure and Operation of Encoder
FIG. 41 is a diagram illustrating the structure of an encoder compliant with ITU-T Recommendation G.729. In FIG. 41, an input signal (voice signal) X of a predetermined number (=N) of samples per frame is input to an LPC (Linear Predictive Coding) analyzer 1 frame by frame. If the sampling speed is 8 kHz and the duration of one frame is 10 ms, then one frame will be composed of 80 samples. The LPC analyzer 1, which is regarded as an all-pole filter represented by the following equation, obtains filter coefficients αi (i=1, . . . , p), where p represents the order of the filter:H(z)=1/[1+Σαi·z−i](i=1 to M)  (1)Generally, in the case of voice in the telephone band, a value of 10 to 12 is used as p. The LPC analyzer 1 performs LPC analysis using 80 samples of the input signal, 40 pre-read samples and 120 past signal samples, for a total of 240 samples, and obtains the LPC coefficients.
A parameter converter 2 converts the LPC coefficients to LSP (Line Spectrum Pair) parameters. An LSP parameter is a parameter of a frequency region in which mutual conversion with LPC coefficients is possible. Since a quantization characteristic is superior to LPC coefficients, quantization is performed in the LSP domain. An LSP quantizer 3 quantizes an LSP parameter obtained by the conversion and obtains an LSP code and an LSP dequantized value. An LSP interpolator 4 obtains an LSP interpolated value from the LSP dequantized value found in the present frame and the LSP dequantized value found in the previous frame. More specifically, one frame is divided into two subframes, namely first and second subframes, of 5 ms each, and the LPC analyzer 1 determines the LPC coefficients of the second subframe but not of the first subframe. Using the LSP dequantized value found in the present frame and the LSP dequantized value found in the previous frame, the LSP interpolator 4 predicts the LSP dequantized value of the first subframe by interpolation.
A parameter deconverter 5 converts the LSP dequantized value and the LSP interpolated value to LPC coefficients and sets these coefficients in an LPC synthesis filter 6. In this case, the LPC coefficients converted from the LSP interpolated values in the first subframe of the frame and the LPC coefficients converted from the LSP dequantized values in the second subframe are used as the filter coefficients of the LPC synthesis filter 6. In the description that follows, the “1” in items having a subscript attached to the “1”, e.g., lspi, li(n), . . . , is the letter “1” in the alphabet.
After LSP parameters lspi (i=1, . . . , M) are quantized by vector quantization in the LSP quantizer 3, the quantization indices (LSP codes) are sent to a decoder.
Next, excitation and gain search processing is executed. Excitation and gain are processed on a per-subframe basis. First, a excitation signal is divided into a periodic component and a non periodic component, an adaptive codebook 7 storing a sequence of past excitation signals is used to quantize the periodic component and an algebraic codebook or fixed codebook is used to quantize the non periodic component. Described below will be voice encoding using the adaptive codebook 7 and a fixed codebook 8 as excitation codebooks.
The adaptive codebook 7 is adapted to output N samples of excitation signals (referred to as “periodicity signals”), which are delayed successively by one sample, in association with indices 1 to L, where N represents the number of samples in one subframe. The adaptive codebook 7 has a buffer for storing the periodic component of the latest (L+39) samples. A periodicity signal comprising 1st to 40th samples is specified by index 1, a periodicity signal comprising 2nd to 41st samples is specified by index 2, . . . , and a periodicity signal comprising Lth to (L+39)th samples is specified by index L. In the initial state, the content of the adaptive codebook 7 is such that all signals have amplitudes of zero. Operation is such that a subframe length of the oldest signals is discarded subframe by subframe in terms of time so that the excitation signal obtained in the present frame will be stored in the adaptive codebook 7.
An adaptive-codebook search identifies the periodicity component of the excitation signal using the adaptive codebook 7 storing past excitation signals. That is, a subframe length (=40 samples) of past excitation signals in the adaptive codebook 7 is extracted while changing, one sample at a time, the point at which read-out from the adaptive codebook 7 starts, and the excitation signals are input to the LPC synthesis filter 6 to create a pitch synthesis signal βAPL, where PL represents a past pitch periodicity signal (adaptive excitation vector), which corresponds to delay L, extracted from the adaptive codebook 7, A the impulse response of the LPC synthesis filter 6, and β the gain of the adaptive codebook.
An arithmetic unit 9 finds an error power EL between the input voice X and βAPL in accordance with the following equation:EL=|X−βAPL|2  (2)
If we let APL represent a weighted synthesized output from the adaptive codebook, Rpp the autocorrelation of APL and Rxp the cross-correlation between APL and the input signal X, then an adaptive excitation vector PL at a pitch lag Lopt for which the error power of Equation (2) is minimum will be expressed by the following equation:PL=argmax(Rxp2/Rpp)  (3)That is, the optimum starting point for read-out from the codebook is that at which the value obtained by normalizing the cross-correlation Rxp between the pitch synthesis signal APL and the input signal X by the autocorrelation Rpp of the pitch synthesis signal is largest. Accordingly, an error-power evaluation unit 10 finds the pitch lag Lopt that satisfies Equation (3). Optimum pitch gain βopt is given by the following equation:βopt=Rxp/Rpp  (4)
Next, the non periodic component contained in the excitation signal is quantized using the fixed codebook 8. The latter is constituted by a plurality of pulses of amplitude 1 or −1. By way of example, Table 1 illustrates pulse positions for a case where subframe length is 40 samples.
TABLE 1G.729A-COMPLIANT FIXED CODEBOOKPULSE SYSTEMPULSE POSITIONPOLARITYi0:1m0:s00, 5, 10, 15, 20, 25,+/−30, 35i1:2m1:s11, 6, 11, 16, 21, 26,+/−31, 36i2:3m2:s22, 7, 12, 17, 22, 27,+/−32, 37i3:4m3:s33, 8, 13, 18, 23, 28,+/−33, 384, 9, 14, 19, 24, 29,34, 39
The algebraic codebook 8 divides the N (=40) sampling points constituting one subframe into a plurality of pulse-system groups 1 to 4 and, for all combinations obtained by extracting one sampling point m0˜m3 from each of the pulse-system groups, successively outputs, as non periodic components, pulsed signals having a +1 or a −1 pulse at each sampling point. In this example, basically four pulses are deployed per subframe.
FIG. 42 is a diagram useful in describing sampling points assigned to each of the pulse-system groups 1 to 4.
(1) Eight sampling points 0, 5, 10, 15, 20, 25, 30, 35 are assigned to the pulse-system group 1;
(2) eight sampling points 1, 6, 11, 16, 21, 26, 31, 36 are assigned to the pulse-system group 2;
(3) eight sampling points 2, 7, 12, 17, 22, 27, 32, 37 are assigned to the pulse-system group 3; and
(4) 16 sampling points 3, 4, 8, 9, 13, 14, 18, 19, 23, 24, 28, 29, 33, 34, 38, 39 are assigned to the pulse-system group 4.
Three bits are required to express the sampling points in pulse-system groups 1 to 3 and one bit is required to express the sign of a pulse, for a total of four bits. Further, four bits are required to express the sampling points in pulse-system group 4 and one bit is required to express the sign of a pulse, for a total of five bits. Accordingly, 17 bits are necessary to specify a pulsed excitation signal output from the fixed codebook 8 having the pulse placement of Table 1, and 217 (=24×24×24×25) types of pulsed excitation signals exist.
The pulse positions of each of the pulse systems are limited, as illustrated in Table 1. In the fixed codebook search, a combination of pulses for which the error power relative to the input voice is minimized in the reconstruction region is decided from among the combinations of pulse positions of each of the pulse systems. More specifically, with βopt as the optimum pitch gain found by the adaptive-codebook search, the output PL of the adaptive codebook is multiplied by βopt and the product is input to an adder 11. At the same time, the pulsed excitation signals are input successively to the adder 11 from the fixed codebook 8 and a pulsed excitation signal is specified that will minimize the difference between the input signal X and a reproduced signal obtained by inputting the adder output to the LPC synthesis filter 6. More specifically, first a target vector X′ for a fixed codebook search is generated in accordance with the following equation from the optimum adaptive codebook output PL and optimum pitch gain βopt obtained from the input signal X by the adaptive-codebook search:X′=X−βoptAPL  (5)
In this example, pulse position and amplitude (sign) are expressed by 17 bits and therefore 217 combinations exist. Accordingly, letting CK represent a kth excitation vector, a excitation vector CK that will minimize an evaluation-function error power D in the following equation is found by a search of the fixed codebook:D=|X′−GcACK|2  (6)where GC represents the gain of the fixed codebook. In the fixed codebook search, the error-power evaluation unit 10 searches for the combination of pulse position and polarity that will afford the largest normalized cross-correlation value (Rcx*Rcx/Rcc) obtained by normalizing the square of a cross-correlation value Rcx between a noise synthesis signal ACK and input signal X′ by an autocorrelation value Rcc of the noise synthesis signal.
Gain quantization will be described next. With the G.729system, fixed codebook gain is not quantized directly. Rather, the adaptive codebook gain Ga (=βopt) and a correction coefficient γ of the fixed codebook gain Gc are vector quantized. The fixed codebook gain Gc and the correction coefficient γ are related as follows:GC=g′×γwhere g′ represents the gain of the present frame predicted from the logarithmic gains of the four past subframes.
A gain quantizer 12 has a gain quantization table, not shown, for which there are prepared 128 (=27) combinations of adaptive codebook gain Ga and correction coefficients γ for fixed codebook gain. The method of the gain codebook search includes {circle around (1)} extracting one set of table values from the gain quantization table with regard to an output vector from the adaptive codebook and an output vector from the fixed codebook and setting these values in gain varying units 13, 14, respectively; {circle around (2)} multiplying these vectors by gains Ga, Gc using the gain varying units 13, 14, respectively, and inputting the products to the LPC synthesis filter 6; and {circle around (3)} selecting, by way of the error-power evaluation unit 10, the combination for which the error power relative to the input signal X is smallest.
A channel multiplexer 15 creates channel data by multiplexing {circle around (1)} an LSP code, which is the quantization index of the LSP, {circle around (2)} a pitch-lag code Lopt, which is the quantization index of the adaptive codebook, {circle around (3)} a noise code, which is an fixed codebook index, and {circle around (4)} a gain code, which is a quantization index of gain. In actuality, it is necessary to perform channel encoding and packetization processing before transmission to the transmission line
Decoder Structure and Operation
FIG. 43 is a block diagram illustrating a G.729A-compliant decoder. Channel data received from the channel side is input to a channel demultiplexer 21, which proceeds to separate and output an LSP code, pitch-lag code, noise code and gain code. The decoder decodes speech data based upon these codes. The operation of the decoder will now be described in brief, though parts of the description will be redundant because functions of the decoder are included in the encoder.
Upon receiving the LSP code as an input, an LSP dequantizer 22 applies dequantization and outputs an LSP dequantized value. An LSP interpolator 23 interpolates an LSP dequantized value of the first subframe of the present frame from the LSP dequantized value in the second subframe of the present frame and the LSP dequantized value in the second subframe of the previous frame. Next, a parameter deconverter 24 converts the LSP interpolated value and the LSP dequantized value to LPC synthesis filter coefficients. A G.729A-compliant synthesis filter 25 uses the LPC coefficient converted from the LSP interpolated value in the initial first subframe and uses the LPC coefficient converted from the LSP dequantized value in the ensuing second subframe.
An adaptive codebook 26 outputs a pitch signal of subframe length (=40 samples) from a read-out starting point specified by a pitch-lag code, and a fixed codebook 27 outputs a pulse position and pulse polarity from a read-out position that corresponds to an algebraic code. A gain dequantizer 28 calculates an adaptive codebook gain dequantized value and a fixed codebook gain dequantized value from the gain code applied thereto and sets these values in gain varying units 29, 30, respectively. An adder 31 creates a excitation signal by adding a signal, which is obtained by multiplying the output of the adaptive codebook by the adaptive codebook gain dequantized value, and a signal obtained by multiplying the output of the fixed codebook by the fixed codebook gain dequantized value. The excitation signal is input to an LPC synthesis filter 25. As a result, reproduced voice can be obtained from the LPC synthesis filter 25.
In the initial state, the content of the adaptive codebook 26 on the decoder side is such that all signals have amplitudes of zero. Operation is such that a subframe length of the oldest signals is discarded subframe by subframe in terms of time so that the excitation signal obtained in the present frame will be stored in the adaptive codebook 26. In other words, the adaptive codebook 7 of the encoder and the adaptive codebook 26 of the decoder are always maintained in the identical, latest state.
Digital Watermark Technique
The specification of Japanese Patent Application Laid-Open No. 11-272299 discloses a “Method of Embedding Watermark Bits when Encoding Voice” as an digital watermark technique to which CELP is applied. FIG. 44 is a diagram useful in describing such an digital watermark technique. In Table 1, refer to the fourth pulse system i3. Unlike the pulse positions m0 to m2 of the other first to third pulse systems i0 to i2, the pulse position m3 of the fourth pulse system i3 differs in that there are mutually adjacent candidates for this position. In accordance with the G.729 standard, pulse position in the fourth pulse system i3 is such that it does not matter if either of the adjacent pulse positions is selected. For example, pulse position m3=4 in the fourth pulse system i3 may be replaced with pulse position m3′=3, and there will be almost no influence upon the human sense of hearing even if encoded voice code is reproduced following such substitution. Accordingly, an 8-bit key Kp is introduced in order to label the m3 candidates. For example, as shown in FIG. 44, Kp=00001111 holds, candidates 3, 8, 13, 18, 23, 28, 33, 38 of m3 are mapped to respective ones of the bits of Kp, *Kp=11110000 holds and candidates 4, 9, 14, 19, 24, 29, 34, 39 of m3 are mapped to respective ones of the bits of *Kp. If mapping is performed in this manner, all of the candidates of m3 can be labeled “0” or “1” in accordance with the key Kp. If a watermark bit “0” is to be embedded in encoded voice code under these conditions, m3 is selected from candidates that have been labeled “0” in accordance with the key Kp. If a watermark bit “1” is to be embedded, on the other hand, m3 is selected from candidates that have been labeled “1” in accordance with the key Kp. This method makes it possible to embed binarized watermark information is encoded voice code. Accordingly, by furnishing both the transmitter and receiver with the key Kp, it is possible to embed and extract watermark information. Since 1-bit watermark information can be embedded every 5-ms subframe, 200 bits can be embedded per second.
If watermark information is embedded in all codes using the same key Kp, there is a good possibility of decryption by an unauthorized third party. This makes it necessary to enhance concealment. If the total value of m0 to m3 is represented by Cp, the total value will be any of the 58 shown at (a) of FIG. 45. Accordingly, a second key Kcon of 58 bits is introduced and the 58 total values Cp are mapped to respective ones of the bits of this key, as illustrated at (b) in FIG. 45. The total value (72 in FIG. 45) of m0 to m3 in noise code when voice has been encoded is calculated and it is determined whether a bit value Cpb of the Kcon conforming to this total value is “0” or “1”. When Cpb=“1” holds, a watermark bit is embedded in the encoded voice code in accordance with FIG. 44. If Cpb=“0” holds, a watermark bit is not embedded. If this arrangement is adopted, a third party who does not know the key Kcon would find it difficult to decrypt the watermark information.
In cases where other media are transmitted on channels that are independent of the voice channel, basically it is required that the terminals at both ends provide multichannel support. A problem which arises in such cases is that limitations are imposed at the terminals connected to a conventional communications network. This is true with regard to 2nd generation mobile telephones, for example, which presently are in most widespread use. Further, even if the terminals at both ends offer multichannel support and make it possible to transmit a plurality of media, routes have a random nature in the case of packet switching, making it difficult to achieve synchronization and linkage at repeaters along the way. A particular problem is that complicated control such as route setting and synchronization processing is required for linkage that employs data accompanying voice per se issued by a specific user.
With the conventional digital watermark technique, use of a key is essential. In addition, the target of embedded data is limited to a pulse position in the fourth pulse system of the fixed codebook. As a consequence, there is a good possibility that the existence of the key will become known to the user. If the user becomes aware of the key, the user can specify the embedded position. This leads to the possibility of leakage and falsification of data.
Further, with the conventional digital watermark technique, since the foregoing is “probability-based” control in which execution or non-execution of data embedding depends upon the total value of pulse position candidates, there is a possibility that the sound-quality degrading effect of embedding of data will be significant. There is need for a data embedding technique as a communication standard in which the embedding of data is concealed, i.e., in which there is no decline in sound quality when decoding (reproduced voice) is performed at a terminal. However, since the prior-art technique results in degraded sound quality, it has not been able to satisfy this need.