1. Field of the Invention
The present invention relates to an audio-signal time-axis expansion/compression method and device for changing the playback speed of music or the like.
2. Description of the Related Art
The PICOLA (Pointer Interval Control Overlap and Add) serving as a time-axis expansion/compression algorithm at a time domain corresponding to a digital speech signal has been known (see “Expansion/compression on the audio time-axis using the duplication adding method by pointer amount-of-movement control (PICOLA) and its evaluation”, by Morita and Itakura, Acoustical Society of Japan collected papers, October 1986, pp 149-150). This algorithm has an advantage in that though its processing is simple and lightweight, good sound quality can be obtained as to a speech signal Description will be made briefly below regarding this PICOLA with reference to drawings. Let us say that with the present specification, the signals other than speech, which are included in music or the like, are referred to acoustic signals, and speech signals and acoustic signals are referred to audio signals in an integrated manner.
FIG. 22 illustrates an example wherein an original waveform is expanded with the PICOLA. First, periods A and B, which have a similar waveform, are found from an original waveform (a). The number of samples at the period A and the number of samples at the period B are the same. Subsequently, a waveform (b) which fades out at the period B is created. Similarly, a waveform (c) which fades in from the period A is created, and the waveform (b) and the waveform (c) are added, thereby obtaining an expanded waveform (d). Thus, adding of the waveform which fades out and the waveform which fades in is referred to as cross-fade. If we say that the cross-fade period between the period A and the period B is represented as a period A×B, the following operations result in a situation wherein the period A and the period B are changed into a period A, a period A×B, and a period B, which are expanded.
FIG. 23 is a schematic view illustrating a method for detecting a period length W between the period A and the period B which have a similar waveform. First, with a processing start position P0 as a starting point, the period A and period B of a sample j are determined such as shown in (a) in FIG. 23. While j is gradually expanded such as (a) in FIG. 23→(b) in FIG. 23→(c) in FIG. 23, the j that makes the periods A and B the most similar is obtained. As for a scale for measuring similarity, the following function D(j) can be employed, for example.D(j)=(1/j)Σ{x(i)−y(i)}^2(i=0 through j−1)  (1)
This D(j) is calculated in a range of WMIN≦j≦WMAX, and j is obtained so as to make the D(j) the minimum. The j at this time is the period length W of the period A and period B. Here, x(i) represents each of the sample values of the period A, and y(i) represents each of the sample values of the period B. Also, the WMAX and WMIN are values of 50 Hz through 250 Hz or so, and if a sampling frequency is 8 kHz, the WMAX is 160, and the WMIN is 32 or so. With the example in FIG. 23, j at (b) is selected as the j which makes the function D(j) the minimum.
FIG. 24 is a schematic view illustrating a method for expanding a waveform into an arbitrary length. First, as shown in FIG. 23, the j which makes the function D(j) the minimum is obtained with the processing start position P0 as a starting point, and W is substituted with j. Subsequently, as shown in FIG. 24, a period 2401 is copied to a period 2403, and the cross-fade waveform of the period 2401 and a period 2402 is created at a period 2404. Subsequently, the remaining period obtained by subtracting the period 2401 from a position P0 through a position P0′ of an original waveform (a) is copied to an expanded waveform (b). According to the above-described operation, L samples from the position P0 through position P0′ of the original waveform (a) become W+L samples at the expanded waveform (b), and the number of samples becomes r times.r=(W+L)/L(1.0<r≦2.0)  (2)
Rewriting this expression regarding L yields Expression (3), and in the event of attempting to multiply the number of samples of the original waveform (a) by r times, it can be found that the position P0′ is determined such as shown in Expression (4).L=W·1/(r−1)  (3)P0′=P0+L  (4)
Further, defining 1/r such as shown in Expression (5) yields Expression (6).R=1/r(0.5≦R<1.0)  (5)L=W·R/(1−R)  (6)
Thus, R is employed, whereby an expression such that the original waveform (a) is played by R-times speed can be employed. Let us say below that this R is referred to as a speech rate conversion rate. Note that with the example in FIG. 24, the number of samples L is around 2.5 W, which is equivalent to slow playback of around 0.7-times speed.
Upon the processing of the position P0 through the position P0′ of the original waveform (a) being completed, the position P0′ is substituted with a position P1 to be newly regarded as the starting point of the processing, and the same processing is repeated.
Subsequently, description will be made regarding time-axis compression of an original waveform. FIG. 25 illustrates an example wherein an original waveform is compressed with PICOLA. First, periods A and B which have a similar waveform are found from the original waveform (a). The number of samples at the period A and the number of samples at the period B are the same. Subsequently, a waveform (b) which fades out at the period A is created. Similarly, a waveform (c) which fades in from the period B is created, and the waveform (b) and the waveform (c) are added, whereby a compressed waveform (d) can be obtained. The period A and period B are changed into a period A×B by performing the above-described operation.
FIG. 26 illustrates a method for compressing a waveform into an arbitrary length. First, as shown in FIG. 23, with the processing start position P0 as a starting point, j is obtained so as to make the function D(j) the minimum, and W is substituted with j. Subsequently, as shown in FIG. 26, the cross-fade waveform of a period 2601 and a period 2602 is created at a period 2603. Subsequently, the remaining period obtained by subtracting the period 2601 and period 2602 from a position P0 through a position P0′ of an original waveform (a) is copied to a compressed waveform (b) According to the above-described operations, W+L samples from the position P0 through position P0′ of the original waveform (a) become L samples at the compressed waveform (b), and the number of samples becomes r times.r=L/(W+L)(0.5≦r<1.0)  (7)
Rewriting this Expression (7) regarding L yields Expression (8), and in the event of multiplying the number of samples of the original waveform (a) by r times, it can be found that the position P0′ is determined such as shown in Expression (9).L=W·r/(1−r)  (8)P0′=P0+(W+L)  (9)
Further, if 1/r is defined such as shown in Expression (10), Expression (11) is obtained.R=1/r(1.0<R≦2.0)  (10)L=W·1/(R−1)  (11)
Thus, R is employed, whereby an expression such that the original waveform (a) is played by R-times speed can be made. Upon the processing of the position P0 through the position P0′ of the original waveform (a) being completed, the position P0′ is substituted with a position P1 to be newly regarded as the starting point of the processing, and the same processing is repeated.
With the example in FIG. 26, the number of samples L is around 1.5 W, which is equivalent to slow playback of around 1.7-times speed.
FIG. 27 is a flowchart illustrating the flow of waveform time-axis expansion processing of PICOLA. In step S1001, determination is made regarding whether or not there is any audio signal to be processed in the input buffer, and in the event that there is no audio signal, the processing ends. In the event that there is an audio signal to be processed, the flow proceeds to step S1002, j which makes the function D(j) the minimum is obtained with the processing start position P as a starting point, and W is substituted with j. In step S1003, L is obtained from the speech rate conversion rate R specified by a user, and in step S1004, the period A equivalent to the W samples from the processing start position P is output to the output buffer. In step S1005, the period A equivalent to the W samples from the processing start position P and the period B equivalent to the next W samples are obtained, which is referred to as a period C, and in step S1006, this period C is output to the output buffer. In step S1007, the L−W samples from the position P+W of the input buffer are output (copied) to the output buffer. In step S1008, the processing start position P is moved to the P+L, and the flow returns to step S1001, where the processing is repeatedly performed.
FIG. 28 is a flowchart illustrating the flow of waveform time-axis compression processing of PICOLA. In step S5101, determination is made regarding whether or not there is any audio signal to be processed in the input buffer, and in the event that there is no audio signal, the processing ends. In the event that there is an audio signal to be processed, the flow proceeds to step S1102, j which makes the function D(j) the minimum is obtained with the processing start position P as a starting point, and W is substituted with j. In step S1103, L is obtained from the speech rate conversion rate R specified by a user, and in step S1104, the cross-fade of the period A equivalent to the W samples from the processing start position P, and the period B equivalent to the next W samples is obtained, which is referred to as a period C, and in step S1105, this period C is output to the output buffer. In step S1106, the L−W samples from the position P+2W of the input buffer are output (copied) to the output buffer. In step S1107, the processing start position P is moved to the P+(W+L), and the flow returns to step S1101, where the processing is repeatedly performed.
FIG. 29 is one example of the configuration of a speech rate conversion device 100 according to PICOLA. An audio signal to be processed is first subjected to buffering in an input buffer 101. A similar-waveform-length extracting unit 102 obtains j which makes the function D(j) the minimum, and substitutes W with j. The W obtained by the similar-waveform-length extracting unit 102 is passed to the input buffer 101, and is employed for buffer operations. The similar-waveform-length extracting unit 102 passes 2 W samples serving as audio signals to a connection-waveform generating unit 103. The connection-waveform generating unit 103 cross-fades the 2 W samples serving as audio signals into the W samples. The audio signals are transmitted from the input buffer 101 and the connection-waveform generating unit 103 to the output buffer 104 in accordance with the speech rate conversion rate R. The audio signal generated at the output buffer 104 is output from the speech conversion device as an output audio signal.
FIG. 30 is a flowchart illustrating the flow of the processing in the connection-waveform generating unit 103 in the configuration example in FIG. 29. In the case of time-axis expansion, let us say that each of the sample values of the period A is x(i) (i=0, 1, and so on through W−1), and each of the sample values of the period B is y(i) (i=0, 1, and so on through W−1), and in the case of time-axis compression, let us say that each of the sample values of the period B is x(i) (i=0, 1, and so on through W−1), and each of the sample values of the period A is y(i) (i=0, 1, and so on through W−1). Also, let us say that each of the sample values after cross-fade is z(i) (i=0, 1, and so on through W−1).
In step S1201, the index i is reset to zero. In step S1202, determination is made regarding whether or not the index i is smaller than W, and in the case of being smaller than W, the flow proceeds to step S1203, and in the case of not smaller than W, the processing ends. In step S1203, weight h=i/W is obtained, and in step 51204, a cross-fade signal Z(i) is calculated.z(i)=hx(i)+(1−h)y(i)  (12)
In step S1205, following the index i being incremented by one, the flow returns to step S1202, where the processing is repeatedly performed. According to the above-described processing, the cross-fade values of the x(i) and y(i) are stored in the z(i).
As described above, as described with reference to FIGS. 22 through 30, an audio signal can be expanded/compressed with an arbitrary speech rate conversion rate R (0.5≦R<1.0, 1.0<R≦2.0) using the speech rate conversion algorithm PICOLA.