1. Field of the Invention
The present invention relates to an audio signal expansion/compression apparatus and an audio signal expansion/compression method for changing a playback speed of an audio signal such as a music signal.
2. Description of the Related Art
PICOLA (Pointer Interval Control OverLap and Add) is known as one of algorithms of expanding/compressing a digital audio signal in a time domain (see, for example, “Expansion and compression of audio signals using a pointer interval control overlap and add (PICOLA) algorithm and evaluation thereof”, Morita and Itakura, The Journal of Acoustical Society of Japan, October, 1986, p. 149-150). An advantage of this algorithm is that the algorithm needs a simple process and can provide good sound quality for a processed audio signal. The PICOLA algorithm is briefly described below with reference to some figures. In the following description, signals such as a music signal other than voice signals are referred to as acoustic signals, and voice signals and acoustic signals are generically referred to as audio signals.
FIGS. 22A to 22D illustrate an example of a process of expanding an original waveform using the PICOLA algorithm. First, intervals having a similar waveform in an original signal (FIG. 22A) are detected. In the example shown in FIG. 22A, intervals A and B similar to each other are detected. Note that intervals A and B are selected so that they include the same number of samples. Next, a fade-out waveform (FIG. 22B) is produced from the waveform in the interval B, and a fade-in waveform (FIG. 22C) is produced from the waveform in the interval A. Finally, an expanded waveform (FIG. 22D) is produced by connecting the fade-out waveform (FIG. 22B) and the fade-in waveform (FIG. 22C) such that the fade-out part and the fade-in part overlap with each other. The connection of the fade-out waveform and the fade-in waveform in this manner is called cross fading. Hereafter, the cross-faded interval between the interval A and the interval B is denoted by A×B. As a result of the process described above, the original waveform (FIG. 22A) including the intervals A and 3 is converted into the expanded waveform (FIG. 22D) including the intervals A, A×B, and B.
FIGS. 23A to 23C illustrate a manner of detecting the interval length W of the intervals A and B which are similar in waveform to each other. First, intervals A and B starting from a start point P0 and including j samples are extracted from an original signal as shown in FIG. 23A and evaluated. The similarity in waveform between the intervals A and B is evaluated while increasing the number of sample j as shown in FIGS. 23A, 23B, and 23C, until highest similarity is detected between the intervals A and B each including j samples. The similarity may be defined, for example, by the following function D(j).D(j)=(1/j)Σ{x(i)−y(i)}2(i=0 to j−1)  (1)where x(i) is the value of an i-th sample in the interval A, and y(i) is the value of an i-th sample in the interval B. D(j) is calculated for j in the range WMIN≦j≦WMAX, and j is determined which results in a minimum value for D(j). The value of j determined in this manner gives the interval length W of intervals A and B having highest similarity. WMAX and WMIN are set in the range of, for example, 50 to 250. When the sampling frequency is 8 kHz, WMAX and WMIN are set, for example, such as WMAX=160 and WMIN=32. In the present example, D(j) has a lowest value in the state shown in FIG. 23B, and j in this state is employed as the value indicating the length of the highest-similarity interval.
Use of the function D(j) described above is important in the determination of the length W of an interval with a similar waveform (hereinafter, referred to simply as a similar-interval length W). This function is used only in finding intervals similar in waveform to each other, that is, this function is used only in a pre-process to determine a cross-fade interval. The function D(j) is applicable even to a waveform having no pitch such as white noise.
FIGS. 24A and 24B illustrate an example of a manner in which a waveform is expanded to an arbitrary length. First, j is determined for which the function D(j) has a minimum value with respect to a start point P0, and W is set to j (W=j) as described above with reference to FIGS. 23A to 23C. Next, an interval 2401 is copied as an interval 2403, and a cross-fade waveform between the intervals 2401 and 2402 is produced as an interval 2404. An intervals obtained by removing the interval 2401 from the total interval from P0 to P0′ in the original waveform shown in FIG. 24A is copied at a position directly following the cross-fade interval 2404 as shown in FIG. 24B. As a result, the original waveform including L samples in the range from the start point P0 to the point P0′ is expanded to a waveform including (W+L) samples. Hereinafter, the ratio of the number of samples included in the expanded waveform to the number of samples included in the original waveform will be denoted by r. That is, r is given the following equation.r=(W+L)/L(1.0<r≦2.0)  (2)Equation (2) can be rewritten as follows.L=W·1/(r−1)  (3)To expand the original waveform (FIG. 24A) by a factor of r, the point P0′ is selected according to equation (4) shown blow.P0′=P0+L  (4)
If R is defined by 1/r as equation (5), then L is given by equation (6) shown below.R=1/r(0.5≦R<1.0)  (5)L=W·R/(1−R)  (6)
By introducing the parameter R as described above, it becomes possible to express the playback length such that “the waveform is played back for a period R times longer than the period of the original waveform” (FIG. 24A). Hereinafter, the parameter R will be referred to as a speech speed conversion ratio. When the process for the range from the point P0 to the point P0′ in the original waveform (FIG. 24A) is completed, the process described above is repeated by selecting the point P0′ as a new start point P1. In the example shown in FIGS. 24A and 24B, the number of samples L is equal to about 2.5 W, the signal is played back at a speed about 0.7 times the original speed. That is, in this case, the signal is played back at a speed slower than the original speed.
Next, a process of compressing an original waveform is described. FIGS. 25A to 25D illustrate an example of a manner in which an original waveform is compressed using the PICOLA algorithm. First, intervals having a similar waveform in an original signal (FIG. 25A) are detected. In the example shown in FIG. 25A, intervals A and B similar to each other are detected. Note that intervals A and B are selected so that they include the same number of samples. Next, a fade-out waveform (FIG. 25B) is produced from the waveform in the interval A, and a fade-in waveform (FIG. 25C) is produced from the waveform in the interval B. Finally, a compressed waveform (FIG. 25D) is produced by superimposing the fade-in waveform (FIG. 25C) on the fade-out waveform (FIG. 25B). As a result of the process described above, the original waveform (FIG. 25A) including the intervals A and B is converted into the compressed waveform (FIG. 25D) including the cross-fade interval A×B.
FIGS. 26A and 26B illustrate an example of a manner in which a waveform is compressed to an arbitrary length. First, j is determined for which the function D(j) has a minimum value with respect to a start point P0, and W is set to j (W=j) as described above with reference to FIGS. 23A to 23C. Next, a cross-fade waveform between the intervals 2601 and 2602 is produced as an interval 2603. An interval obtained by removing the intervals 2601 and 2602 from the total interval from P0 to P0′ in the original waveform shown in FIG. 26A is copied in a compressed waveform (FIG. 26B). As a result, the original waveform including (W+L) samples in the range from the start point P0 to the point P0′ (FIG. 26A) is compressed to a waveform including L samples (FIG. 26B). Thus, the ratio of the number of samples of compressed waveform to the number of samples of original waveform is given by r as described below.r=L/(W+L)(0.5<r1.0)  (7)Equation (7) can be rewritten as follows.L=W·r/(1−r)  (8)To compress the original waveform (FIG. 26A) by a factor of r, the point P0′ is selected according to equation (9) shown blow.P0′=P0+(W+L)  (9)
If R is defined by 1/r as equation (10), then L is given by equation (11) shown below.R=1/r(1.0≦R<2.0)  (10)L=W·1/(R−1)  (11)
By defining the parameter R as described above, it becomes possible to express the playback length such that “the waveform is played back for a period R times longer than the period of the original waveform (FIG. 26A). When the process for the range from the point P0 to the point P0′ in the original waveform (FIG. 26A), the process described above is repeated by selecting the point P0′ as a new start point P1. In the example shown in FIGS. 26A and 26B, the number of samples L is equal to about 1.5 W, the signal is played back at a speed about 1.7 times the original speed. That is, in this case, the signal is played back at a speed faster than the original speed.
Referring to a flow chart shown in FIG. 27, the waveform expanding process according to the PICOLA algorithm is described in further detail below. In step S1001, it is determined whether there is an audio signal to be processed in an input buffer. If there is no audio signal to be processed, the process is ended. If there is an audio signal to be processed, the process proceeds to step S1002. In step S1002, j is determined for which the function D(j) has a minimum value with respect to a start point P, and W is set to j (W=j). In step S1003, L is determined from the speech speed conversion ratio R specified by a user. In step S1004, an audio signal in an interval A including W samples in a range starting from a start point P is output to an output buffer. In step S1005, a cross-fade interval C is produced from the interval A including W samples starting from the start point P and a next interval B including W samples. In step S1006, data in the produced interval C is supplied to the output buffer. In step S1007, data including (L−W) samples in a range staring from a point P+W is output from the input buffer to the output buffer. In step S1008, the start point P is moved to P+L. Thereafter, the processing flow returns to step S1001 to repeat the process described above from step S1001.
Next, referring to a flow chart shown in FIG. 28, the waveform compression process according to the PICOLA is described in further detail below. In step S1101, it is determined whether there is an audio signal to be processed in an input buffer. If there is no audio signal to be processed, the process is ended. If there is an audio signal to be processed, the process proceeds to step S1102. In step S1102, j is determined for which the function D(j) has a minimum value with respect to a start point P, and W is set to j (W=j). In step S1103, L is determined from the speech speed conversion ratio R specified by a user. In step S1104, a cross-fade interval C is produced from the interval A including W samples starting from the start point P and a next interval B including W samples. In step S1105, data in the produced interval C is supplied to the output buffer. In step S1106, data including (L−W) samples in a range staring from a point P+2W is output from the input buffer to the output buffer. In step S1107, the start point P is moved to P+(W+L). Thereafter, the processing flow returns to step S1101 to repeat the process described above from step S1101.
FIG. 29 illustrates an example of a configuration of a speech speed conversion apparatus 100 using the PICOLA algorithm. First, an audio signal to be processed is stored in an input buffer 101. A similar-waveform length detector 102 examines the audio signal stored in the input buffer 101 to detect j for which the function D(j) has a minimum value, and sets W to j (W=j). The similar-waveform length W determined by the similar-waveform length detector 102 is supplied to the input buffer 101 so that the similar-waveform length W is used in a buffering operation. The input buffer 101 supplies 2W samples of audio signal to a connection waveform generator 103. The connection waveform generator 103 compresses the received 2W samples of audio signal into W samples by performing cross-fading. In accordance with the speech speed conversion ratio R, the input buffer 101 and the connection waveform generator 103 supplies audio signals to the output buffer 104. An audio signal is generated by the output buffer 104 from the received audio signals and output, as an output audio signal, from the speech speed conversion apparatus 100.
FIG. 30 is a flow chart illustrating the process performed by the similar-waveform length detector 102 configured as shown in FIG. 29. In step S1201, an index j is set to an initial value of WMIN. In step S1202, a subroutine shown in FIG. 31 is executed to calculate a function D(j), for example, given by equation (12) shown below.D(j)=(1/j)Σ{f(i)−f(j+i)}2(i=0 to j−1)  (12)where f is the input audio signal. In the example shown in FIG. 23A, samples starting from the start point P0 are given as the audio signal f. Note that equation (12) is equivalent to equation (1). In the following discussion, the function D(j) expressed in the form of equation (12) will be used. In step S1203, the value of the function D(j) determined by executing the subroutine is substituted into a variable MIN, and the index j is substituted into W. In step S1204, the index j is incremented by 1. In step S1205, a determination is made as to whether the index j is equal to or smaller than WMAX. If the index j is equal to or smaller than WMAX, the process proceeds to step S1206. However, if the index j is greater than WMAX, the process is ended. The value of the variable W obtained at the end of the process indicates the index j for which the function D(j) has a minimum value, that is, this value gives the similar-waveform length, and the variable MIN in this state indicates the minimum value of the function D(j). In step S1206, the subroutine shown in FIG. 31 is executed to determine the value of the function D(j) for a new index j. In step S1207, it is determined whether the value of the function D(j) determined in step S1206 is equal to or smaller than MIN. If so the process proceeds to step S1208, but otherwise the process returns to step S1204. In step S1208, the value of the function D(j) determined by executing the subroutine is substituted into the variable MIN, and the index j is substituted into W.
The subroutine shown in FIG. 31 is executed as follows. In step S1301, the index i and a variable s are reset to 0. In step S1302, it is determined whether the index i is smaller than the index j. If so, the process proceeds to step S1303, but otherwise the process proceeds to step S1305. In step S1303, the square of the difference between the magnitude of the audio signal for i and that for j+i, and the result is added to the variable s. In step S1304, the index i is incremented by 1, and the process returns to step S1302. In step S1305, the variable s is divided by j, and the result is set as the value of the function D(j), and the subroutine is ended.
The manner of performing the speech speed conversion on a monaural signal using the PICOLA algorithm has been described above. For a stereo signal, the speech speed conversion according to the PICOLA algorithm is performed, for example, as follows.
FIG. 32 illustrates an example of a functional block configuration for the speech speed conversion using the PICOLA algorithm. In FIG. 32, an L-channel audio signal is denoted simply as L, and an R-channel audio signal is denoted simply by R. In the example shown in FIG. 32, the process is performed simply as the same manner as that to shown in FIG. 29, independently for the L-channel and the R-channel. This method is simple, but is not widely used in practical applications because the speech speed conversion performed independently for the R channel and the L channel can result in a slight difference in synchronization between the R channel and the L channel, which makes it difficult to achieve precise localization of the sound. If the location of the sound fluctuates, a user will have a very uncomfortable feeling.
In a case where two speakers are placed at right and left locations to reproduce a stereo signal, a listener feels as if a reproduced sound comes from an area in the middle between the right and left speakers. In some cases, the apparent location of a sound source sensed by a listener moves between the two speakers. However, in most cases, the audio signal is produced so that the apparent location of a sound source is fixed in the middle between the two speakers. However, even if a slight difference in temporal phase between right and left channels occurs as a result of the speech speed conversion, the difference causes the location of the sound, which should be in the middle of the two speakers, to fluctuate between the right and left speakers. Such a fluctuation in the sound location causes a listener to have a very uncomfortable. Therefore, in the speech speed conversion for a stereo signal, it is very important not to create a difference in synchronization between right and left channels.
FIG. 33 illustrates an example of a speech speed conversion apparatus configured to perform the speech speed conversion on a stereo signal without creating a difference in synchronization between right and left channels (see, for example, Japanese Unexamined Patent Application Publication No. 2001-255894). When an input audio signal to be processed is given, a left-channel signal is stored in an input buffer 301, and a right-channel signal is stored in an input buffer 305. A similar-waveform length detector 302 detects a similar-waveform length W for the audio signals stored in the input buffer 301 and the input buffer 305. More specifically, the average of the L-channel audio signal stored in the input buffer 301 and the R-channel audio signal stored in the input buffer 305 is determined by an adder 309, thereby converting the stereo signal into a monaural signal. The similar-waveform length W is determined for this monaural signal by detecting j for which the function D(j) has a minimum value, and W is set to j (W=j). The similar-waveform length W determined for the monaural signal is used as the similar-waveform length W in common for the R-channel audio signal and the L-channel audio signal. The similar-waveform length W determined by the similar-waveform length detector 302 is supplied to the input buffer 301 of the L channel and the input buffer 305 of the R channel so that the similar-waveform length W is used in a buffering operation.
The L-channel input buffer 301 supplies 2W samples of L-channel audio signal to a connection waveform generator 303. The R-channel input buffer 305 supplies 2W samples of R-channel audio signal to a connection waveform generator 307.
The connection waveform generator 303 converts the received 2W samples of L-channel audio signal into W samples of audio signal by performing the cross-fading process. The connection waveform generator 307 converts the received 2W samples of R-channel audio signal into W samples of audio signal by performing the cross-fading process.
The audio signal stored in the L-channel input buffer 301 and the audio signal produced by the connection waveform generator 303 are supplied to an output buffer 304 in accordance with a speech speed conversion ratio R. The audio signal stored in the R-channel input buffer 305 and the audio signal produced by the connection waveform generator 307 are supplied to an output buffer 308 in accordance with the speech speed conversion ratio R. The output buffer 304 combines the received audio signals thereby producing an L-channel audio signal, and the output buffer 308 combines the received audio signals thereby producing an R-channel audio signal. The resultant R and L-channel audio signals are output from the speech speed conversion apparatus 300.
FIG. 34 is a flow chart illustrating a processing flow associated with the process performed by the similar-waveform length detector 302 and the adder 309. The process shown in FIG. 34 is similar to that shown in FIG. 31 except that the function D(j) indicating the measure of similarity between two waveforms is calculated differently. In FIG. 34 and in the following description, fL denotes a sample value of an L-channel audio signal, and fR denotes a sample value of an R-channel audio signal.
The subroutine shown in FIG. 34 is executed as follows. In step S1401, the index i and a variable s are reset to 0. In step S1402, it is determined whether the index i is smaller than the index j. If so the process proceeds to step S1403, but otherwise the process proceeds to step S1405. In step S1403, the stereo signal is converted into a monaural signal and the square of the difference of the difference of the monaural signal is determined, and the result is added to the variable s. More specifically, the average value a of an i-th sample value of the L-channel audio signal and an i-th sample value of the R-channel audio signal is determined. Similarly, the average value b of a (i+j)th sample value of the R-channel audio signal and an (i+j)th sample value of the L-channel audio signal is determined. These average values an and b respectively indicate i-th and (i+j)th monaural signals converted from the stereo signals. Thereafter, the square of the difference between the average value a and the average value b, and the result is added to the variable s. In step S1404, the index i is incremented by 1, and the process returns to step S1402. In step S1405, the variable s is divided by the index j, and the result is set as the value of the function D(j). The subroutine is then ended.
FIG. 35 illustrates a configuration of a speech speed conversion apparatus disclosed in Japanese Unexamined Patent Application Publication No. 2002-297200. This configuration is similar to that shown in FIG. 33 in that the speech speed conversion is performed without creating a difference in synchronization between R and L channels, but different in that a different input signal is used in detection of the similar-waveform length. More specifically, in the configuration shown in FIG. 35, unlike the configuration shown in FIG. 33 in which the monaural signal is produced by calculating the average between R and L-channel audio signals, energy of each frame is determined for each of R and L channels, and a channel with greater energy is used as a monaural signal.
In the configuration shown in FIG. 35, when an audio signal to be processed is input, a left-channel signal is stored in an input buffer 401, and a right-channel signal is stored in an input buffer 405. A similar-waveform length detector 402 detects a similar-waveform length W for the audio signal stored in the input buffer 401 or the input buffer 405 corresponding to a channel selected by the channel selector 409. More specifically, the channel selector 409 determines energy of each frame of the L-channel audio signal stored in the input buffer 401 and that of the R-channel audio signal stored in the input buffer 405, and the channel selector 409 selects an audio signal with greater energy thereby converting the stereo signal into the monaural audio signal. For this monaural audio signal, the similar-waveform length detector 402 determines the similar-waveform length W by detecting j for which the function D(j) has a minimum value, and sets W to j (W=j). The similar-waveform length W determined for the channel having greater energy is used in common as the similar-waveform length W for the R-channel audio signal and the L-channel audio signal. The similar-waveform length W determined by the similar-waveform length detector 402 is supplied to the input buffer 401 of the L channel and the input buffer 405 of the R channel so that the similar-waveform length W is used in a buffering operation. The L-channel input buffer 401 supplies 2W samples of L-channel audio signal to a connection waveform generator 403. The R-channel input buffer 405 supplies 2W samples of R-channel audio signal to a connection waveform generator 407. The connection waveform generator 403 converts the received 2W samples of L-channel audio signal into W samples of audio signal by performing the cross-fading process.
The connection waveform generator 407 converts the received 2W samples of R-channel audio signal into W samples of audio signal by performing the cross-fading process.
The audio signal stored in the L-channel input buffer 401 and the audio signal produced by the connection waveform generator 403 are supplied to an output buffer 404 in accordance with a speech speed conversion ratio R. The audio signal stored in the R-channel input buffer 405 and the audio signal produced by the connection waveform generator 407 are supplied to an output buffer 408 in accordance with the speech speed conversion ratio R. The output buffer 404 combines the received audio signals thereby producing an L-channel audio signal, and the output buffer 408 combines the received audio signals thereby producing an R-channel audio signal. The resultant R and L-channel audio signals are output from the speech speed conversion apparatus 400.
The process performed by the similar-waveform length detector 402 configured as shown in FIG. 35 is performed in a similar manner to that shown in FIGS. 30 and 31 except that the R-channel audio signal or the L-channel audio signal with greater energy is selected by channel selector 409 and supplied to the similar-waveform length detector 402.
As described above with reference to FIGS. 22 to 35, it is possible to expand or compress an audio signal at an arbitrary speech speed conversion ratio R (0.5≦R<1.0 or 1.0<R≦2.0) according to the speech speed conversion algorithm (PICOLA) even for stereo signals without causing a fluctuation in location of the sound source.