1. Field of the Invention
The present invention relates to a method and an apparatus for audio signal expansion and compression for altering the playback speed of music or the like.
2. Description of the Related Art
PICOLA (Pointer Interval Control OverLap and Add) is known as one of the algorithms for expanding and compressing digital audio signals in the time domain. This algorithm advantageously provides good sound quality for voice signals while requiring simple processing and low processing load. PICOLA will be described briefly below with reference to the accompanying drawings. Hereinafter, signals, contained in music or the like, other than voice signals are referred to as acoustic signals, and voice signals and acoustic signals are collectively referred to as audio signals.
FIGS. 13A to 13D show an example of expansion of an original waveform using PICOLA. Firstly, intervals A and B having similar waveforms are found from an original waveform (FIG. 13A). The intervals A and B have an identical number of samples. A fade-out waveform (FIG. 13B) is then generated in the interval B. Similarly, a fade-in waveform (FIG. 13C) is generated from the interval A. An expanded waveform (FIG. 13D) is obtained by adding the waveform shown in FIG. 13B and the waveform shown in FIG. 13C. Adding a fade-out waveform and a fade-in waveform in this way is referred to as cross-fading. Herein, suppose that an interval obtained by cross-fading the intervals A and B is represented as an interval A×B. By performing the above-described operations, the intervals A and B are changed into the interval A, the interval A×B, and the interval B. That is, the intervals A and B are expanded.
FIGS. 14A to 14C are schematic diagrams showing a method for detecting an interval length W of the intervals A and B containing similar waveforms. Firstly, the intervals A and B having j samples are set as shown in FIG. 14A by using a processing start point P0 as an origin. A value of j where the waveforms in the intervals A and B resemble each other the most is determined while gradually increasing j as shown in FIGS. 14A, 14B, and 14C sequentially. For example, the following function D(j) can be used as a scale for measuring the similarity.D(j)=(1/j)Σ{x(i)−y(i)}^2 (i=0 to j−1)   (1)
The value j that gives the minimum value for the function D(j) is determined by calculating the function D(j) in a range of WMIN≦j≦WMAX. The value j determined at this time corresponds to an interval length W of the intervals A and B. Here, x(i) indicates each sampled value in the interval A, whereas y(i) indicates each sampled value in the interval B. In addition, WMAX and WMIN are values of approximately 50 Hz to 250 Hz, for example. If a sampling frequency is set to 8 kHz, WMAX and WMIN are equal to approximately 160 and 32, respectively. In the example shown in FIGS. 14A to 14C, the value j determined in FIG. 14B is selected as the value j that gives the minimum value for the function D(j).
It is important to utilize the foregoing function D(j) to determine the interval length W of similar waveforms. This function is designated to search intervals having waveforms that resemble each other the most and is particularly used in preprocessing for determining the cross-fade interval. In addition, this processing can be applied to waveforms not having pitch, such as a white noise.
FIGS. 15A and 15B are schematic diagrams showing a method for expanding a waveform to a given length. Firstly, as shown in FIGS. 14A to 14C, a processing start point P0 is set as an origin, and a value j that gives the minimum value for the function D(j) is determined. The interval length W is set to equal to j. As shown in FIGS. 15A and 15B, a waveform in an interval 1401 is then copied in an interval 1403, and a cross-fade waveform of waveforms in the intervals 1401 and 1402 is generated in an interval 1404. A waveform in an interval from the point P0 to a point P0′ of the original waveform (FIG. 15A) excluding the interval 1401 is copied behind the expanded waveform (FIG. 15B). With the above-described operations, the number of samples in the expanded waveform (FIG. 15B) is increased to W+L samples from L samples in the interval between the point P0 and the point P0′ of the original waveform (FIG. 15A). That is, the number of samples is multiplied by “r”.r=(W+L)/L (1.0<r≦2.0)   (2)
Equation (3) is obtained by solving Equation (2) with respect to L. It is known that only the point P0′ has to be determined as shown in Equation (4) to multiply the number of samples in the original waveform (FIG. 15A) by r.L=W·1/(r−1)   (3)P0′=P0+L   (4)
Furthermore, Equation (6) is obtained by letting 1/r be equal to R as shown in Equation (5).R=1/r (0.5≦R<1.0)   (5)L=W·R/(1−R)   (6)
By using a variable R in this manner, an expression of “playback of the original waveform (FIG. 15A) at R-fold speed” can be used. Hereinafter, this variable R is referred to as a speech speed converting rate. Additionally, in the example shown in FIGS. 15A and 15B, the number of samples L is equivalent to approximately 2.5 W, which corresponds to approximately 0.7-fold slow playback.
After the completion of processing on the interval between the point P0 and the point P0′ of the original waveform (FIG. 15A), the point P0′ is set as a point P1, i.e., an origin, and similar operations are repeated.
Compression of an original waveform will be described next. FIGS. 16A to 16D show an example of compression of an original waveform using PICOLA. Firstly, intervals A and B having similar waveforms are found from an original waveform (FIG. 16A). The intervals A and B have an identical number of samples. A fade-out waveform (FIG. 16B) is then generated in the interval A. Similarly, a fade-in waveform (FIG. 16C) is generated from the interval B. A compressed waveform (FIG. 16D) is obtained by adding the waveform shown in FIG. 16B and the waveform shown in FIG. 16C. By performing the above-described operations, the intervals A and B are changed into an interval A×B.
FIGS. 17A and 17B show a method for compressing a waveform to a given length. Firstly, as shown in FIGS. 14A to 14C, a processing start point P0 is set as an origin, and a value j that gives the minimum value for the function D(j) is determined. The interval length W is set to j. As shown in FIGS. 17A and 17B, a cross-fade waveform of waveforms in the intervals 1601 and 1602 is generated in an interval 1603. A waveform in an interval from the point P0 to a point P0′ of the original waveform (FIG. 17A) excluding the intervals 1601 and 1602 is copied behind the compressed waveform (FIG. 17B). With the above-described operations, the number of samples in the compressed waveform (FIG. 17B) is decreased to L samples from W+L samples in the interval from the point P0 to the point P0′ of the original waveform (FIG. 17A). That is, the number of samples is multiplied by “r”.r=L/(W+L) (0.5≦r<1.0)   (7)
Equation (8) is obtained by solving Equation (7) with respect to L. It is known that only the point P0′ has to be determined as shown in Equation (9) to multiply the number of samples in the original waveform (FIG. 17A) by r.L=W·r/(1−r)   (8)P0′=P0+(W+L)   (9)
Furthermore, Equation (11) is obtained by letting 1/r be equal to R as shown in Equation (10).R=1/r (1.0<R≦2.0)   (10)L=W·1/(R−1)   (11)
By using a variable R in this manner, an expression of “playback of the original waveform (FIG. 17A) at R-fold speed” can be used. After the completion of processing on the interval between the point P0 and the point P0′ of the original waveform (FIG. 17A), the point P0′ is set as a point P1, i.e., an origin, similar operations are repeated.
In the example shown in FIGS. 17A and 17B, the number of samples L is equivalent to approximately 1.5 W, which corresponds to approximately 1.7-fold fast playback.
FIG. 18 is a flowchart showing a process flow of waveform expansion in PICOLA. At STEP S1001, whether an audio signal to be processed exists in an input buffer or not is determined. If the audio signal does not exist in the input buffer, the process is terminated. If the audio signal to be processed exists, the process proceeds to STEP S1002. A processing start point P is set as an origin, and a value j that gives a minimum value for a function D(j) is determined. An interval length W is set equal to the value j. At STEP S1003, a value L is determined from a speech speed converting rate R specified by a user. At STEP S1004, data corresponding to an interval A for W samples from the processing start point P is output to an output buffer. At STEP S1005, a cross-fade waveform of waveforms in the interval A containing W samples from the processing start point P and the interval B containing the next W samples is determined and set as an interval C. At STEP S1006, the data in the interval C is output to the output buffer. At STEP S1007, data for L-W samples is output (copied) to the output buffer from a point P+W in the input buffer. At STEP S1008, the processing start point P is moved to the point P+L. The process then returns to STEP S1001, and the above-described steps are repeated.
FIG. 19 is a flowchart showing a process flow of waveform compression in PICOLA. At STEP S1101, whether an audio signal to be processed exists in an input buffer or not is determined. If the audio signal does not exist, the process is terminated. If the audio signal to be processed exists, the process proceeds to STEP S1102. A processing start point P is set as an origin, and a value j that gives a minimum value for a function D(j) is determined. An interval length W is set equal to the value j. At STEP S1103, a value L is determined from a speech speed converting rate R specified by a user. At STEP S1104, a cross-fade waveform of waveforms in the interval A containing W samples from the processing start point P and the interval B containing the next W samples is determined and set as an interval C. At STEP S1105, the data in the interval C is output to an output buffer. At STEP S1106, data for L-W samples is output (copied) to the output buffer from a point P+2 W in the input buffer. At STEP S1107, the processing start point P is moved to the point P+(W+L). The process then returns to STEP S1101, and the above-described steps are repeated.
FIG. 20 shows an example of a configuration of a speech speed converting apparatus 100 using PICOLA. An input buffer 101 buffers an audio signal to be processed. A similar waveform length extracting unit 102 determines a value j that gives a minimum value for a function D(j) using the audio signal contained in the input buffer 101, and sets an interval length W equal to j. The input buffer 101 is supplied with the information about the interval length W determined by the similar waveform length extracting unit 102. The input buffer 101 utilizes the interval length W for buffer operations. The similar waveform length extracting unit 102 supplies the audio signals for 2 W samples to a connected waveform generating unit 103. The connected waveform generating unit 103 cross-fades the received audio signals for 2 W samples to generate a cross-fade waveform for W samples. Audio signals are sent to an output buffer 104 from the input buffer 101 and the connected waveform generating unit 103 in accordance with the speech speed converting rate R. An audio signal generated in the output buffer 104 is output from the speech speed converting apparatus as an output audio signal.
Now, a similar waveform length extracting process using a speech speed converting algorithm PICOLA will be described with reference to flowcharts shown in FIGS. 21 and 22. At STEP S1201, an index j is set to an initial value WMIN. At STEP S1202, a subroutine is executed. The subroutine calculates the function D(j) represented by Equation (12) as a scale for measuring the similarity.D(j)=(1/j)Σ{f(i)−f(j+i)}^2 (i=0 to j−1)   (12)
Here, f(j) indicates an input audio signal. For example, in an example shown in FIGS. 14A to 14C, f(j) indicates samples from the point P0. Additionally, Equations (1) and (12) represent the same content. Equation (12) is used hereinafter.
At STEP S1203, the value of the function D(j) determined by the subroutine is substituted for a variable min, and the index j is substituted for the interval length W. At STEP S1204, the index j is incremented by 1. At STEP S1205, whether the index j is greater than WMAX or not is determined. If the index j is not greater than WMAX, the process proceeds to STEP S1206. On the other hand, if the index j is greater than WMAX, the process is terminated.
The value of the variable W at the time of termination of the process corresponds to the index j that minimizes the function D(j), i.e., the length of a similar waveform. The value of the variable min at that time indicates the minimum value of the function D(j).
At STEP S1206, a subroutine determines the value of the function D(j) for the new index j. At STEP S1207, whether the value of the function D(j) determined at STEP S1206 is greater than the variable min or not is determined. If the value of the function D(j) is not greater than min, the process proceeds to STEP S1208. If the value of the function D(j) is greater than min, the process returns to STEP S1204. At STEP S1208, the value of the function D(j) is substituted for the variable min, and the value of the index j is substituted for the interval length W.
FIG. 22 shows a process flow of the subroutine. At STEP S1209, an index i and a variable s are reset to 0. At STEP S1210, whether the index i is smaller than the index j or not is determined. If the index i is smaller than the index j, the process proceeds to STEP S1211. If the index i is not smaller than the index j, the process proceeds to STEP S1213. At STEP S1211, a square of a difference between the input audio signals is determined, and is added to the variable s.s=s+{f(i)−f(j+i)}^2   (13)
At STEP S1212, the index i is incremented by 1, and the process returns to STEP S1210. At STEP S1213, a value of the function D(j) is set to a value obtained by dividing the variable s by the index j, and the subroutine is terminated.D(j)=s/i   (14)
FIG. 23 is a diagram for illustrating a similar waveform length extracting process described in FIGS. 21 and 22. In this example, WMIN and WMAX are set to 3 and 10, respectively. A value of function D(j) is determined while sequentially increasing the index j by 1 from 3 to 10. The value of the function D(j) becomes smaller when waveforms are more similar. Accordingly, the value of the function D(j) becomes minimum when j=8, and the interval length W is equal to 8.
As described above, a speech speed converting algorithm PICOLA can expand and compress audio signals at a given speech speed converting rate R (where, 0.5≦R<1.0, 1.0<R≦2.0) by extracting the length of similar waveforms.
PICOLA is described in, for example, an article by Morita and Itakura entitled “Time-Scale Modification Algorithm for Speech By Use of Pointer Interval Control Overlap and Add (PICOLA) and its Evaluation”, Proceeding of National Meeting of the Acoustic Society of Japan, October, 1986, pp. 149-150.