Voice encoding decoding methods of 1.6 kbps in voice information rate which are presented in Patent Literature 1 and Non-Patent Literature 1 will be described as prior art by using FIG. 1 to FIG. 9.
A configuration of a conventional system voice encoder is shown in FIG. 1. A framer 111 is a buffer which stores an input voice sample (a1) which is bandlimited at 100 to 3800 Hz, thereafter is sampled at 8 kHz and is quantized with an accuracy of at least 12 bits and it fetches the voice samples (160 samples) per 1 voice encoding frame (20 ms) and outputs them to a voice encoding processing unit as (b1). In the following, processing which is executed per 1 voice encoding frame will be described.
A gain calculator 112 calculates a logarithm of an RMS (Root Mean Square) value which is level information of (b1) and outputs (c1) which is a result thereof. A quantizer 1_113 lineally quantizes (c1) with 5 bits and outputs (d1) which is a result thereof to a bit packing device 125. A linear prediction analyzer 114 performs linear prediction analysis on (b1) using a Durbin-Levinson method and outputs a 10th-order linear prediction coefficient (e1) which is spectrum envelope information.
An LSF coefficient calculator 115 converts the 10th-order linear prediction coefficient (e1) into a 10th-order LSF (Line Spectrum Frequencies) coefficient (f1).
A quantizer 2_116 is configured to use multi-stage vector quantization of 3 stages (7, 6, 5 bits) and to switchingly use memoryless vector quantization and prediction (memory) vector quantization, and quantizes the 10th-order LSF coefficient (f1) with 19 (=1+7+6+5) bits by allocating 1 bit to switching thereof and outputs an LSF parameter index (g1) which is a result thereof to the bit packing device 125. An LPF (low-pass filter) 120 filters (b1) at a cutoff frequency of 1000 Hz and outputs (k1). A pitch detector 121 obtains a pitch period from (k1) and outputs it as (m1).
Although the pitch period is given as a delay amount that a normalized autocorrelation function is maximized, a maximum value (l1) of the normalized autocorrelation function at that time is also output. The magnitude of the maximum value of the normalized autocorrelation function is information which indicates the strength of periodicity of the input signal (b1) and is used in an aperiodic flag generator 122 which will be described later.
In addition, the maximum value (l1) of the normalized autocorrelation function is corrected by a correlation coefficient corrector 119 which will be described later and then is used for voiced/voiceless decision by a voiced/voiceless decider 126. There, when a maximum value (j1) of the normalized autocorrelation function after correction is not more than a threshold value (=0.6), it is decided to be voiceless and when it is not so, it is decided to be voiced and a voiced/voiceless flag (s1) which is a result thereof is output. Here, the voiced/voiceless flag corresponds to the low frequency band voiced/voiceless discrimination information in claims. A quantizer 3_123 inputs (m1) and performs logarithmic transformation thereon and thereafter linearly quantizes it at 99 levels and outputs a pitch index (o1) which is a result thereof to a periodic/aperiodic pitch and voiced/voiceless information code generator 127.
FIG. 2 is a diagram showing a relation between the pitch period and the index in a conventional system.
The relation between the pitch period (taking a range of 20 to 160 samples) which is an input into the quantizer 3_123 and the index value (taking a range of 0 to 98) which is an output therefrom is shown in FIG. 2.
The aperiodic flag generator 122 inputs the maximum value (l1) of the normalized autocorrelation function, sets an aperiodic flag ON when it is smaller than a threshold value (=0.5) and sets it OFF when it is not so and outputs the aperiodic flag (1 bit) (n1) to the aperiodic pitch index generator 124 and the periodic/aperiodic pitch and voiced/voiceless information code generator 127. When the aperiodic flag (n1) is ON, it means that a current frame is a sound source having aperiodicity. An LPC analysis filter 117 is an all-zero filter which uses the 10th-order linear prediction coefficient (e1) as a coefficient and removes the spectrum envelope information from the input signal (b1) and outputs a residual signal (h1) which is a result thereof. A peakiness calculator 118 inputs the residual signal (h1), calculates a peakiness value and outputs it as (i1). The peakiness value is a parameter which indicates the possibility of presence of a pulsed component (a spike) having a peak in the signal and is given by (Formula
                    [                  Numerical          ⁢                                          ⁢          Formula          ⁢                                          ⁢          1                ]                                                                      Peakiness          ⁢                                          ⁢          value          ⁢                                          ⁢          p                =                                                            1                N                            ⁢                                                ∑                                      n                    =                    1                                    N                                ⁢                                  e                  n                  2                                                                                        1              N                        ⁢                                          ∑                                  n                  =                  1                                N                            ⁢                                                                e                  n                                                                                                        (                  Formula          ⁢                                          ⁢          1                )            
Here, N is the number of samples in 1 frame and en is the residual signal. Since the numerator of (Formula 1) is liable to be influenced by a large value in comparison with the denominator, p has a large value when there exists a large spike in the residual signal. Accordingly, the larger the peakiness value is, the more the possibility that the frame is a voiced frame having jitters which are often observed in a transient part or a plosive frame is increased (because in these frames, although it has partially the spike (a sharp peak), other part is in the form of a signal of the property which is close to that of the white noise).
When the peakiness value (i1) is larger than “1.34”, the correlation coefficient corrector 119 sets the maximum value (l1) of the normalized autocorrelation function to “1.0” (indicating the voiced one) and outputs (j1). Calculation of the peakiness value and correlation function correction processing are processing adapted to detect the voiced frame having the jitters or the plosive frame and correct the maximum value of the normalized autocorrelation function to “1.0” (the value indicating the voiced one).
Although the voiced frame having the jitters or the plosive frame partially has the spike (the sharp peak), the other part is in the form of the signal of the property which is close to that of the white noise and therefore the possibility that the normalized autocorrelation function before correction becomes smaller than “0.5” is large (that is, the possibility that the aperiodic flag is set ON is large). On the other hand, the peakiness value becomes large. Accordingly, when the voiced frame having the jitters or the plosive frame is detected in accordance with the peakiness value and the normalized autocorrelation function is corrected to “1.0”, it is decided to be voiced in later voiced/voiceless decision by the voiced/voiceless decider 126 and an aperiodic pulse is used in the sound source when decoding and therefore the sound quality of the voiced frame having the jitters or the plosive frame is improved.
An aperiodic pitch index generator 124 non-uniformly quantizes the pitch period (m1) in the aperiodic frame at 28 levels and outputs an index (p1). The details of the processing thereof will be shown in the following. First, a result that the frequency of the pitch period has been examined for the frame (corresponding to the voiced frame having the jitters in the transient part or the plosive frame) that the voiced/voiceless flag (s1) is set to the voiced one and the aperiodic flag (n1) is set ON is shown in FIG. 3 and the cumulative frequency thereof is shown in FIG. 4.
FIG. 3 is a diagram showing the frequency of the pitch period of the conventional system. FIG. 4 is a diagram showing the cumulative frequency of the pitch period of the conventional system.
FIG. 3 and FIG. 4 are results of measurement of voice data which is configured by four men and four women (6 voice samples/per person) and adds up to 112.12 [s] (5606 frames). As the frame which satisfies the above-described conditions (the voiced/voiceless flag (s1) is the voiced one and the aperiodic flag (n1) is ON), there existed 425 frames in 5606 frames. It is seen from FIG. 3 that a distribution of the pitch period in the frame (hereinafter, referred to as the aperiodic frame) which satisfies that condition is concentrated on around 25 to 100. Accordingly, it can be highly efficiently transmitted by performing nonuniform quantization based on the frequency (the appearance frequency), that is, by quantizing more finely the pitch period which is larger in frequency and more roughly the pitch period which is smaller in it. In addition, the pitch period of the aperiodic frame is calculated from (Formula 2) in a decoder.Pitch period of aperiodic frame=Transmitted pitch period (1.0+0.25×Random number value)  (Formula 2)
The transmitted pitch period in (Formula 2) is the pitch period which is transmitted in accordance with an index which is an output from the aperiodic pitch index generator 124 and the jitter is added per pitch period by multiplying (1.0+0.25×the random number value). Accordingly, the larger the pitch period is, the more the amount of the jitters is increased and therefore rough quantization is allowed. A quantization table for the pitch period of the aperiodic frame which is based on the above is shown in Table 1. In Table 1, the input pitch period which is within a range from 20 to 24 is quantized at 1 level, the one which is within a range from 25 to 50 is quantized at 13 levels (2 steps in width), the one which is within a range from 51 to 95 is quantized at 9 levels (5 steps in width), the one which is within a range from 95 to 135 is quantized at 4 levels (10 steps in width) and the one which is within a range from 136 to 160 is quantized at 1 level and the indexes (Aperiodic 0 to 27) are output. 64 levels or more are necessary for quantization of a general pitch period. On the other hand, as for quantization of the pitch period of the aperiodic frame, it becomes possible to quantize it at 28 levels by taking the frequency, the decoding method into consideration.
TABLE 1Pitchperiod ofPitchaperiodicperiod offrameaperiodicafterframequantizationIndex20-2424Aperiodic 025, 2626Aperiodic 127, 2828Aperiodic 229, 3030Aperiodic 331, 3232Aperiodic 433, 3434Aperiodic 535, 3636Aperiodic 637, 3838Aperiodic 739, 4040Aperiodic 841, 4242Aperiodic 943, 4444Aperiodic 1045, 4646Aperiodic 1147, 4848Aperiodic 1249, 5050Aperiodic 1351-5555Aperiodic 1456-6060Aperiodic 1561-6565Aperiodic 1666-7070Aperiodic 1771-7575Aperiodic 1876-8080Aperiodic 1981-8585Aperiodic 2086-9090Aperiodic 2191-9595Aperiodic 22 96-105100Aperiodic 23106-115110Aperiodic 24116-125120Aperiodic 25126-135130Aperiodic 26136-160140Aperiodic 28
The periodic/aperiodic pitch and voiced/voiceless information code generator 127 inputs the voiced/voiceless flag (s1), the aperiodic flag (n1), the pitch index (o1), the aperiodic pitch index (p1) and outputs a 7-bit (128-level) periodic/aperiodic pitch-voiced/voiceless information code (t1). Processing performed here will be described in the following.
In a case where the voiced/voiceless flag (s1) shows the voiceless one, a codeword that 7 bits are all 0s is allocated in the 7-bit code (having 128 kinds of codewords). In a case where the flag shows the voiced one, the remaining codewords (127 kinds) are allocated to the pitch indexes (o1) or the aperiodic pitch indexes (p1) on the basis of the aperiodic flag (n1). When the aperiodic flag (n1) is ON, the codewords (28 kinds) that 1 bit and 2 bits become(s) 1(s) in 7 bits are allocated to the aperiodic pitch indexes (p1) (Aperiodic 0 to 27). Other codewords (99 kinds) are allocated to the periodic pitch indexes (Periodic 0 to 98). A generation table for the periodic/aperiodic pitch-voiced/voiceless information codes which are based on the above is shown in Table 2.
In general, in a case where an error occurs in the voiced/voiceless information due to transmission error and the voiceless frame is erroneously decoded as the voiced frame, the periodic sound source is used and therefore the quality of the reproduced voice is remarkably deteriorated. Since the sound source signal is made by an aperiodic pitch pulse by allocating the aperiodic pitch indexes (p1) (Aperiodic 0 to 27) to the codewords (28 kinds) that 1 bit and 2 bits become(s) 1(s) in 7 bits, it is possible to reduce the influence of the transmission error even when 1-bit or 2-bit error occurs in a voiceless codeword (0x0) due to the transmission error.
TABLE 2CodeIndex0 × 0Voiceless0 × 1Aperiodic 00 × 2Aperiodic 10 × 3Aperiodic 20 × 4Aperiodic 30 × 5Aperiodic 40 × 6Aperiodic 50 × 7Periodic 00 × 8Aperiodic 60 × 9Aperiodic 70 × AAperiodic 80 × BPeriodic 10 × CAperiodic 90 × DPeriodic 20 × EPeriodic 30 × FPeriodic 40 × 10Aperiodic 100 × 11Aperiodic 110 × 12Aperiodic 120 × 13Periodic 50 × 14Aperiodic 130 × 15Periodic 60 × 16Periodic 70 × 17Periodic 80 × 18Aperiodic 140 × 19Periodic 90 × 1APeriodic 100 × 1BPeriodic 110 × 1CPeriodic 120 × 1DPeriodic 130 × 1EPeriodic 140 × 1FPeriodic 150 × 20Aperiodic 150 × 21Aperiodic 160 × 22Aperiodic 170 × 23Periodic 160 × 24Aperiodic 180 × 25Periodic 170 × 26Periodic 180 × 27Periodic 190 × 28Aperiodic 190 × 29Periodic 200 × 2APeriodic 210 × 2BPeriodic 220 × 2CPeriodic 230 × 2DPeriodic0 × 2EPeriodic 240 × 2FPeriodic 260 × 30Aperiodic 200 × 31Periodic 270 × 32Periodic 280 × 33Periodic 290 × 34Periodic 300 × 35Periodic 310 × 36Periodic 320 × 37Periodic 330 × 38Periodic 340 × 39Periodic 350 × 3APeriodic 360 × 3BPeriodic 370 × 3CPeriodic 380 × 3DPeriodic 390 × 3EPeriodic 400 × 3FPeriodic 410 × 40Aperiodic 210 × 41Aperiodic 220 × 42Aperiodic 230 × 43Periodic 420 × 44Aperiodic 240 × 45Periodic 430 × 46Periodic 440 × 47Periodic 450 × 48Aperiodic 250 × 49Periodic 460 × 4APeriodic 470 × 4BPeriodic 480 × 4CPeriodic 490 × 4DPeriodic 500 × 4EPeriodic 510 × 4FPeriodic 520 × 50Aperiodic 260 × 51Periodic 530 × 52Periodic 540 × 53Periodic 550 × 54Periodic 560 × 55Periodic 570 × 56Periodic 580 × 57Periodic 590 × 58Periodic 600 × 59Periodic 610 × 5APeriodic 620 × 5BPeriodic 630 × 5CPeriodic 640 × 5DPeriodic 650 × 5EPeriodic 660 × 5FPeriodic 670 × 60Aperiodic 270 × 61Periodic 680 × 62Periodic 690 × 63Periodic 700 × 64Periodic 710 × 65Periodic 720 × 66Periodic 730 × 67Periodic 740 × 68Periodic 750 × 69Periodic 760 × 6APeriodic 770 × 6BPeriodic 780 × 6CPeriodic 790 × 6DPeriodic 800 × 6EPeriodic 810 × 6FPeriodic 820 × 70Periodic 830 × 71Periodic 840 × 72Periodic 850 × 73Periodic 860 × 74Periodic 870 × 75Periodic 880 × 76Periodic 890 × 77Periodic 900 × 78Periodic 910 × 79Periodic 920 × 7APeriodic 930 × 7BPeriodic 940 × 7CPeriodic 950 × 7DPeriodic 960 × 7EPeriodic 970 × 7FPeriodic 98
An HPF (high-pass filter) 128 filters (b1) at a cutoff frequency of 1000 Hz and outputs a high frequency component (the component of at least 1000 Hz) (u1). A correlation coefficient calculator 129 calculates and outputs a normalized autocorrelation function (v1) in a delay amount which is given to (u1) in the pitch period (m1). A voiced/voiceless decider 130 decides to be voiceless when the normalized autocorrelation function (v1) is not more than the threshold value (=0.5) and decides to be voiced when it is not so and outputs a high range voiced/voiceless flag (w1) which is a result thereof. Here, the high range voiced/voiceless flag corresponds to high frequency band voiced/voiceless discrimination information in claims.
The bit packing device 125 inputs the quantized RMS value (the gain information) (d1), the LSF parameter index (g1), the voiced/voiceless pitch-voiced/voiceless information code (f1) and the high range voiced/voiceless flag (w1) and outputs a voice information bit string (q1) of 32 bits per 1 frame (20 ms) (Table 3).
TABLE 3ParameterNumber of bitsLSF parameter19Gain/frame5Periodic/aperiodic pitch-voiced/7voiceless information codeHigh range voiced/voiceless flag1Total bits/20 ms frame32
Next, a configuration of a conventional voice decoder will be described by using FIG. 5. FIG. 5 is a diagram showing one example of the conventional system voice decoder.
A bit separator (131) separates a 32-bit voice information bit string (a2) which is received per 1 frame into each parameter and outputs a periodic/aperiodic pitch-voiced/voiceless information code (b2), a high range voiced/voiceless flag (f2), gain information (m2) and an LSF parameter index (h2). A voiced/voiceless information-pitch period decoder 132 inputs the periodic/aperiodic pitch-voiced/voiceless information code (b2), seeks which one of Voiceless/Periodic/Aperiodic is indicated on the basis of Table 2, sets a pitch period (c2) to “50” and sets the voiced/voiceless flag (d2) to “0” when Voiceless is indicated and outputs them.
In a case of Periodic and Aperiodic, it performs decoding processing on the pitch period (c2) (in a case of Aperiodic, Table 1 is used) and outputs it and sets the voiced/voiceless flag (d2) to “1.0” and outputs it.
A jitter setter 133 inputs the periodic/aperiodic pitch-voiced/voiceless information code (b2), seeks which one of Voiceless/Periodic/Aperiodic is indicated on the basis of Table 2 and in a case where Voiceless or Aperiodic is indicated, sets a jitter value (e2) to “0.25” and outputs it. In a case where Periodic is indicated, it sets the jitter value (e2) to “0” and outputs it.
An LSF decoder 138 decodes a 10th-order LSF coefficient (i2) from the LSF parameter index (h2) and outputs it. An inclination correction coefficient calculator 137 calculates an inclination correction coefficient (j2) from the 10th-order LSF coefficient (i2). The inclination correction coefficient is a coefficient adapted to correct inclination of a spectrum and to reduce muffling of a sound in an adaptive spectrum enhancement filter 145 which will be described later.
A gain decoder 139 decodes gain information (m2) and outputs a gain (n2). A linear prediction coefficient calculator 1_136 converts the LSF coefficient (i2) into a linear prediction coefficient and outputs a linear prediction coefficient (k2).
A spectrum envelope amplitude calculator 135 calculates a spectrum envelope amplitude (l2) from the linear prediction coefficient (k2). Here, the voiced/voiceless flag (d2), the high range voiced/voiceless flag (f2) respectively correspond to the low frequency band voiced/voiceless discrimination information, the high frequency band voiced/voiceless discrimination information in claims.
In the following, a configuration of a pulse sound source/noise sound source mixing ratio calculator 134 will be described using FIG. 6.
FIG. 6 shows the configuration of the pulse sound source/noise sound source mixing ratio calculator and it inputs the voiced/voiceless flag (d2), the spectrum envelope amplitude (l2) and the high range voiced/voiceless flag (f2) in FIG. 5 and determines and outputs a mixing ratio (g2) in each band (sub-band).
In mixing ratio determination in FIG. 6 and decoding processing in FIG. 5, it is divided into 4 bands on a frequency axis and the mixing ratio of the pulse sound source to the noise sound source and a mixed signal thereof are obtained in each band. As the 4 bands, a sub-band 1 (0 to 1000 Hz), a sub-band 2 (1000 to 2000 Hz), a sub-band 3 (2000 to 3000 Hz) and a sub-band 4 (3000 to 4000 Hz) are set. The sub-band 1 corresponds to a low frequency band and the sub-bands 2, 3, 4 respectively correspond to respective bands of high frequencies.
A sub-band 1 voiced strength setter 160 in FIG. 6 inputs the voiced/voiceless flag (d2) and sets a voiced strength (a4) of the sub-band 1. Here, when the voiced/voiceless flag (d2) is “1.0”, the voiced strength (a4) is set to “1.0” and when the voiced/voiceless flag (d2) is “0”, the voiced strength (a4) is set to “0”. A sub-bands 2, 3, 4 average amplitude calculator 161 inputs the spectrum envelope amplitude (l2), calculates average values of the spectrum envelope amplitudes in the sub-bands 2, 3, 4 and outputs them as (b4), (c4) and (d4) respectively. A sub-band selector 162 inputs (b4), (c4) and (d4) and outputs a sub-band number (e4) that the average value of the spectrum envelope amplitudes is maximized.
A sub-bands 2, 3, 4 voiced strength table (for the voiced one) 163 stores 3 three-dimensional vectors (f41), (f42), (f43) and each three-dimensional vector is configured by the voiced strengths of the sub-bands 2, 3, 4 when it is the voiced frame.
A switch 1_165 selects 1 vector (h4) from within the 3 three-dimensional vectors in accordance with the sub-band number (e4) and outputs it. A sub-bands 2, 3, 4 voiced strength table (for the voiceless one) 164 stores 3 three-dimensional vectors (g41), (g42), (g43) in the same way and each three-dimensional vector is configured by the voiced strengths of the sub-bands 2, 3, 4 when it is the voiceless frame.
A switch 2_166 selects 1 vector (i4) from within the 3 three-dimensional vectors in accordance with the sub-band number (e4) and outputs it. A switch 3_167 inputs the high range voiced/voiceless flag (f2) and selects (h4) when it indicates the voiced one and selects (i4) when it indicates the voiceless one and outputs it as (j4).
A mixing ratio calculator 168 inputs the voiced strength (a4) of the sub-band 1 and the voiced strength (j4) of the sub-bands 2, 3, 4 and outputs the mixing ratio (g2) in each sub-band. The mixing ratio (g2) is configured by sb1_p, sb2_p, sb3_p, sb4_p which indicate ratios of the pulse sound source in the respective sub-bands and sb1_n, sb2_n, sb3_n, sb4_n which indicate ratios of the noise sound source therein (here, in sbx_y, x indicates a sub-band number, and indicates the pulse sound source when y is p and the noise sound source when y is n). As sb1_p, sb2_p, sb3_p, sb4_p, the values of the voiced strength (a4) of the sub-band 1 and the voiced strengths (j4) of the sub-bands 2, 3, 4 are used as they are respectively. sbx_n (x=1, . . . 4) is set such that sbx_n=(1.0−sbx_p) (x=1, . . . 4).
Next, a determination method for the sub-bands 2, 3, 4 voiced strength table (for the voiced one) will be described. Values of the table in Table 4 are determined on the basis of a result of voiced strength measurement of the sub-bands 2, 3, 4 in the voiced frame in FIG. 7.
A measurement method in FIG. 7 will be described in the following.
Average values of the spectrum envelope amplitudes in the respective sub-bands 2, 3, 4 are calculated per frame (20 ms) for an input voice and they are classified into 3 frame groups of a group (expressed as fg_sb2) of the frames in which that of the sub-band 2 is maximized, a group (expressed as fg_sb3) of the frames in which that of the sub-band 3 is maximized and a group (expressed as fg_sb4) of the frames in which that of the sub-band 4 is maximized.
Next, the voiced frame which belongs to the frame group fg_sb2 is divided into sub-band signals corresponding to the sub-bands 2, 3, 4, normalized autocorrelation functions of the respective sub-band signals in the pitch period are obtained and an average value thereof is obtained per sub-band.
FIG. 7 is a graph showing the voiced strengths (when it is voiced) of the sub-bands 2, 3, 4 in the conventional system.
The horizontal axis in FIG. 7 shows the sub-band number thereof. Since the normalized autocorrelation function is a parameter which indicates the strength of periodicity of an input signal, that is, the strength of voicing perception, it means the voiced strength. The vertical axis in FIG. 7 indicates the voiced strength (the normalized autocorrelation) of each sub-band signal. In the drawing, a curved line which is marked with ♦ (diamond) shows a result of measurement of fg_sh2. Likewise, a result of measurement of the frame group fg_sb3 is shown by a curved line which is marked with ▪ (square) and a result of measurement of the frame group fg_sb4 is shown by a curved line which is marked with ▴ (triangle). The input voce signals used in the measurement are configured by voices from a voice database CD-ROM and voices recorded from FM broadcasts. It is seen from FIG. 7 that there is a tendency as follows.
In the frames (the mark ♦ and the mark ▪) that the average value of the spectrum envelope amplitudes in the sub-band 2 or 3 is maximized, the voiced strength is monotonically reduced as the frequency of the sub-band becomes high.
In the frame (the mark ▴) that the average value of the spectrum envelope amplitudes in the sub-band 4 is maximized, the voiced strength is not monotonically reduced and the voiced strength of the sub-band 4 is comparatively strengthened as the frequency of the sub-band becomes high. In addition, the voiced strengths of the sub-bands 2, 3 are weakened (in comparison with cases (the mark ♦ and the mark ▪) where the average value of the spectrum envelope amplitudes in the sub-band 2 or 3 is maximized).
The voiced strength of the sub-band 2 of the frame (the mark ♦) that the average value of the spectrum envelope amplitudes of the sub-band 2 is maximized becomes larger than the voiced strengths of the sub-band 2 marked with ▪ and ▴. Likewise, the voiced strength of the sub-band 3 of the frame (the mark ▪) that the average value of the spectrum envelope amplitudes of the sub-band 3 is maximized becomes larger than the voiced strengths of the sub-band 3 marked with ♦ and ▴. Likewise, the voiced strength of the sub-band 3 of the frame (the mark ▴) that the average value of the spectrum envelope amplitudes of the sub-band 4 is maximized becomes larger than the voiced strengths of the sub-band 4 marked with ♦ and ▪.
Accordingly, a value of the voiced strength of the curved line which is marked with ♦ is stored as (f41) in FIG. 6, a value of the voiced strength of the curved line which is marked with ▪ is stored as (f42), a value of the voiced strength of the curved line which is marked with ▴ is stored as (f43) and they are selected on the basis of the sub-band number that (e4) indicates, and thereby an appropriate voiced strength can be set in accordance with the spectrum envelope amplitude. Details of the voiced strength table (for the voiced one) of the sub-bands 2, 3, 4 are shown in Table 4.
TABLE 4Voiced strengthVector numberSub-band 2Sub-band 3Sub-band 4(f41)0.2850.7130.627(f42)0.810.750.67(f43)0.7730.6910.695
FIG. 8 is a graph showing the voiced strengths (when it is voiceless) of the sub-bands 2, 3, 4 in the conventional system.
The sub-bands 2, 3, 4 voiced strength table (for the voiceless one) 164 makes determination on the basis of a result of measurement of the voiced strengths of the sub-bands 2, 3, 4 in the voiceless frame in FIG. 8. The measurement method in FIG. 8 and the method of determining the details of the table are exactly the same as those in the case of the above-described voiced frame. It is seen from FIG. 8 that there is the following tendency.
The voiced strength of the sub-band 2 of the frame (the mark ♦) that the average value of the spectrum envelope amplitudes of the sub-band 2 is maximized becomes smaller than the voiced strengths of the sub-band 2 marked with ▪ and ▴. Likewise, the voiced strength of the sub-band 3 of the frame (the mark ▪) that the average value of the spectrum envelope amplitudes of the sub-band 3 is maximized becomes smaller than the voiced strengths of the sub-band 3 marked with ♦ and ▴. Likewise, the voiced strength of the sub-band 3 of the frame (the mark ♦) that the average value of the spectrum envelope amplitudes of the sub-band 4 is maximized becomes smaller than the voiced strengths of the sub-band 4 marked with ♦ and ▪. Details of the table in FIG. 8 are shown in Table 5.
TABLE 5Voiced strengthVector numberSub-band 2Sub-band 3Sub-band 4(g101)0.2470.2630.301(g102)0.340.2530.317(g103)0.3240.2660.29
A parameter interpolator 140 linearly interpolates the respective parameters (c2), (a2), (g2), (j2) (i2) and (n2) in synchronization with the pitch period respectively and outputs (o2), (p2), (r2), (s2), (t2) and (u2). Linear interpolation processing which is performed here is performed in accordance with (Formula 3).Parameter after interpolation=Parameter of current frame×int+Parameter of previous frame×(1.0−int)  (Formula 3)
Here, the parameter of the current frame corresponds to each of (c2), (e2), (g2), (j2), (i2) and (n2) and the parameter after interpolation corresponds to each of (o2), (p2), (r2), (s2), (t2) and (u2). The parameter of the previous frame is given by holding (c2), (e2), (g2), (j2), (i2) and (n2) in the previous frame.
int is an interpolation coefficient and is obtained using (Formula 4).int=to/160  (Formula 4)
Here, 160 is the number of samples per voice decoding frame length (20 ms) and to is a start sample point of 1 pitch period in a decoding frame and is updated by adding the pitch period every time a reproduced voice for 1 pitch period is decoded. When to exceeds “160”, it means termination of decoding processing of that frame and “160” is subtracted from to. A pitch period calculator 141 inputs interpolated pitch period (o2) and jitter value (p2) and calculates a pitch period (q2) using (Formula 5).Pitch period (q2)=Pitch period (o2)×(1.0−Jitter value (p2)×Random number value)  (Formula 5)
Here, the random number value takes a value within a range from −1.0 to 1.0. Although the pitch period (q2) has a numerical figure after the decimal point, it is rounded off and is converted into an integer. In the following, the pitch period (q2) which is converted into the integer will be expressed as an integer pitch period (q2). Since the jitter value is set to “0.25” in the voiceless or aperiodic frame from (Formula 5), the jitter is added and since the jitter value is set to “0” in a perfectly periodic frame, the jitter is not added. However, since the jitter value is subjected to interpolation processing per pitch, there also exists a pitch section to which an intermediate jitter amount for obtaining a range from 0 to 0.25 is added.
To generate the aperiodic pitch (the pitch with the jitter being added) in this way is effective in reducing a tone-like noise by expressing an irregular (aperiodic) glottic pulse which generates in the transient part, the plosive.
A 1-pitch waveform decoder 150 decodes and outputs a reproduced voice (b3) per integer pitch period (q2). Accordingly, all blocks included in this block input the integer pitch period (q2) and operate in synchronization therewith.
A pulse generator 142 outputs a single pulse signal (v2) in a term of the integer pitch period (q2). A noise generator 143 outputs a white noise (w2) which has a length of the integer pitch period (q2). A mixed sound source generator 144 mixes the single pulse signal (v2) with the white noise (m2) on the basis of a mixing ratio (r2) of each sub-band after interpolation and outputs a mixed sound source signal (x2).
A configuration of the mixed sound source generator 144 is shown in FIG. 9. FIG. 9 is a diagram showing the mixed sound source generator of the conventional system.
First, a course of generating a mixed signal (q5) of the sub-band 1 will be described. An LPF 1_170 bandlimits the single pulse signal (v2) at 0 to 1 kHz and outputs (a5). An LPF 2_171 bandlimits the white noise (w2) at 0 to 1 kHz and outputs (b5). A multiplier 1_178, a multiplier 2_179 multiply (a5), (b5) by sb1_p, sb1_n included in the mixing ratio information (r2) and output (i5), (j5) respectively.
An adder 1_186 adds (i5) and (j5) together and outputs the mixed signal (q5) of the sub-band 1. Also, a mixed signal (r5) of the sub-band 2 is formed by using a BPF 1_172, a BPF 2_173, a multiplier 3_180, a multiplier 4_181, and an adder 2_189 similarly. Also, a mixed signal (s5) of the sub-band 3 is formed by using a BPF 3_174, a BPF 4_175, a multiplier 5_182, a multiplier 6_183, and an adder 3_190 similarly. Also, a mixed signal (t5) of the sub-band 4 is formed by using an HPF 1_176 a HF 2_177 a multiplier 7_184, a multiplier 8_185, and an adder 4_191 similarly. An adder 5_192 adds the mixed signals (q5), (r5), (s5) and (t5) of the respective sub-bands together and synthesizes a mixed sound source signal (x2).
A linear prediction coefficient calculator 2_147 converts the LSF coefficient (t2) after interpolation into a linear prediction coefficient and outputs a liner prediction coefficient (c3). An adaptive spectrum enhancement filter 145 is an adaptive pole-zero filter which uses the one that bandwidth extension processing is performed on the linear prediction coefficient (c3) as a coefficient and improves the naturality of the reproduced voice by making resonance of formants sharp and thereby improving the degree of approximation of a natural voice to the formants. Further, it corrects the inclination of the spectrum by using an interpolated inclination correction coefficient (s2) and thereby reduces muffling of the sound. The mixed sound source signal (x2) is filtered by the adaptive spectrum enhancement filter 145 and (y2) which is a result thereof is output. An LPC synthesis filter 146 is an all-pole filter which uses the linear prediction coefficient (c3) as the coefficient and adds the spectrum envelope information to the sound source signal (y2) and outputs a signal (z2) which is a result thereof. A gain adjustor 148 performs gain adjustment on (z2) by using gain information (u2) and outputs (a3). A pulse diffusion filter 149 is a filter adapted to improve the degree of approximation of the pulse sound source waveform to the glottic pulse waveform of the natural voice and filters (a3) and outputs a reproduced signal (b3) which is improved in naturality.