The embodiments described herein relate to a voice data processing apparatus, a voice data processing method and an imaging apparatus. More particularly, the embodiments described herein relate to a voice data processing apparatus and a voice data processing method both of which convert voice data to voice playback data so as to correspond to a magnification of a playback speed or velocity at the reproduction of the voice data by an OLA (Overlap-Add) method. The embodiments described herein also relate to an imaging apparatus including the voice data processing apparatus.
An imaging apparatus such as a magnetic resonance imaging (MRI) apparatus executes scans on a photographing or imaging area of a subject thereby to execute imaging on the photographing area.
For example, the magnetic resonance imaging apparatus transmits each RF pulse to the imaging area of the subject in an imaging space formed with a static magnetic field thereby to excite spins of proton in the imaging area by a nuclear magnetic resonance (NMR) phenomenon and receives magnetic resonance (MR) signals generated by the excited spins. Thereafter, the magnetic resonance signals obtained by the scan's execution are used as raw data to generate a magnetic resonance image about the imaging area of the subject.
There is a case in which in such an imaging apparatus, body-motion artifacts occur in an image generated about a subject corresponding to a living body such as a human body due to the fact that body motion such as breathing exercises occur in the subject upon execution of each scan.
Therefore, when imaging is carried out, respiration guide information for guiding the breathing exercises is transmitted to the subject held in the imaging space by voice to prevent the occurrence of body motion due to the breathing, for example.
For example, voice data indicating that the subject is caused to stop breathing is automatically reproduced/outputted and instructed prior to the start of each scan so as to match with the timing provided to execute the scan. Namely, the respiration guide information is transmitted to the subject by voice using a so-called Auto Voice function.
There is a case in which upon the reproduction/output of the voice data as described above, the magnification of a playback velocity for the voice data is changed.
There is a case in which in the imaging apparatus, for example, the magnification of the playback velocity for the voice data is changed to complete the playback of voice indicative of the respiration guide information during a period in which a contrast agent is injected into the blood flowing in the subject and the injected contrast agent reaches the imaging area in which the imaging is executed on the subject.
Here, data processing for converting the voice data to voice playback data so as to correspond to the set magnification of playback velocity is executed and the converted voice playback data is reproduced and outputted.
When the playback speed is changed, the musical pitch of the voice generally changes. Described concretely, when the magnification of the playback velocity is raised (the playback velocity is accelerated), the voice is high pitched, whereas when the magnification of the playback velocity is reduced (the playback velocity is made slow), the voice is low pitched. There is a case in which it is not easy for the subject to hear the reproduced voice accurately because the musical pitch of the voice reproduced in this way changes, thus encountering difficulties in executing imaging efficiently.
In order to improve such an imperfection or problem, an OLA method has been known as a method for suppressing a change in musical pitch (refer to, for example, Japanese Unexamined Patent Publication No. Hei 08(1996)-287612, Japanese Unexamined Patent Publication No. 2005-266571, and European Patent EP 0865026).
A WSOLA (Waveform Similarity Overlap-Add) method has been known as a method for further improving the OLA method (refer to, for example, W. Verhelst, M. Roelands, “An Overlap-Add Technique Based on Waveform Similarity (WSOLA) for High Quality Time-Scale Modification of Speech”, Acoustics, Speech, and Signal Processing, 1993. ICASSP-93.).
FIGS. 6A through 6E are respectively diagrams showing data processing for converting voice data to voice playback data so as to correspond to a magnification of a playback velocity at the reproduction of the voice data by an OLA method.
In the OLA method, voice data D is inputted as shown in FIG. 6A. Thereafter, a plurality of voice data blocks Bn (where n=1, 2, . . . , i)(where i: integer) are set to the voice data D as shown in FIG. 6B.
Here, the voice data D is set in such a manner that lengths (time intervals) Iin on a time base, of the respective voice data blocks Bn become identical to one another.
Described concretely, each of the lengths Iin of the voice data blocks Bn is defined to be a value obtained by multiplying a predetermined value Iout by a playback-velocity magnification V. For example, the predetermined value Iout is assumed to be 90 ms and the length Iin of each voice data block Bn is assumed to be 180 ms when the playback velocity is set to a playback velocity equal to twice a reference velocity.
Next, as shown in FIG. 9(c), a plurality of voice data segments Sn (where n=1, 2, . . . , i)(where i: integer) are set to the voice data D so as to correspond to the set voice data blocks Bn.
Here, the start point of the time base for each voice data segment Sn corresponds to the start point of the time base for each voice data block Bn. Further, the respective voice data segments Sn are defined such that the lengths LSn thereof on the time base become identical.
Here, as shown in FIG. 6D, an area or region between the start point of the time base and the point of time at which a predetermined time has elapsed therefrom is set as a first overlap area Sna (where n=1, 2, . . . , i)(where i: integer) at each of the voice data segments Sn set as described above. At each of the voice data segments Sn, an area or region from the end point of the time base to the point of time at which a predetermined time is retraced therefrom is set as a second overlap area Snb (where n=1, 2, . . . , i)(where i: integer).
Described concretely, the value obtained by adding the length LO of each of the overlap areas Sna and Snb to a predetermined value Iout is set as the length LS of each voice data segment Sn. Assuming that for example, the predetermined value Iout is 90 ms and the length LO of each of the overlap areas Sna and Snb is 10 ms, the length LS of each voice data segment Sn is set as 100 ms.
Next, as shown in FIG. 6E, the first and second overlap areas S(n−1)a and Snb set to the voice data segments Sn are combined so as to overlap each other thereby to generate voice playback data DS.
Described concretely, the second overlap area S1b set to the first voice data segment S1, and the first overlap area S2a set to the second voice data segment S2 sided with the first voice data segment S1 along the time base are combined so as to overlap each other. The respective voice data segments Sn are processed sequentially in like manner. Namely, data processing is repeated in such a manner that after similar processing has been executed on the second voice data segment S2 and the third voice data segment S3, the third voice data segment S3 and the fourth voice data segment S4 are subjected to the similar processing, whereby voice playback data DS is generated.
Here, voice data in the second overlap area Snb provided in a stage subsequent to each of the respective voice data segments Sn, and voice data in the first overlap area S(n+1)a provided in a stage prior to its subsequent voice data segment Sn+1 are combined, to normalize power of voice data in the mutual overlap areas Sna and S(n+1)b. For example, a trapezoidal window function is added up to the respective voice data segments Sn, followed by execution of their combination.
Therefore, according to the OLA method, a change in the musical pitch at the time that the playback velocity is changed can be suppressed.
In the OLA method, however, there is a case in which the waveform of the voice data in the second overlap area Snb of each voice data segment Sn is different markedly from the voice data in the first overlap area S(n+1)a caused to overlap with its subsequent voice data segment Sn+1. Therefore, there is a case in which the voice combined in the mutually-related overlap areas Snb and S(n+1)a becomes unnatural.
In order to improve such an imperfection or problem, there has been proposed a WSOLA method in which the OLA method has been improved.
FIGS. 11 and 12 are respectively diagrams showing data processing for converting voice data to voice playback data so as to correspond to a magnification of a playback velocity taken upon reproduction of voice data by the WSOLA method.
In the WSOLA method in a manner similar to the OLA method, the voice data blocks Bn have been set to the voice data D as shown in FIGS. 6A through 6C. Thereafter, voice data segments Sn are set so as to correspond to the respective set voice data segments Bn.
However, in the WSOLA method unlike the OLA method, the position on the time base, of the voice data segment Sn+1 following each voice data segment Sn is adjusted after the execution of Steps shown in FIGS. 6A through 6C in such a manner that the waveform of voice data in an area including the second overlap area Snb at each voice data segment Sn and the waveform of voice data in an area including the first overlap area S(n+1)a at the voice data segment Sn+1 approximate each other. Namely, the voice data segment S(n+1) is moved in such a manner that similarity indicative of a resemblance between the waveform of the voice data in the area including the second overlap area Snb at the voice data segment Sn, and the waveform of the voice data in the area including the first overlap area S(n+1)a at its subsequent voice data segment Sn+1 becomes large.
Described concretely, as shown in FIG. 7A, an area in which a predetermined time has elapsed from a start point of a time base, is set as a first similarity calculation area Mna and an area in which a predetermined time is retraced from an end point of the time base, is set as a second similarity calculation area Mnb with respect to initially-set respective voice data segments Sn.
At first and second voice data segments S1 and S2 sequentially arranged along the time base at the voice data segments Sn, the similarity between the waveform of voice data in a second similarity calculation area M1b set to the first voice data segment S1 and the waveform of voice data in a first similarity calculation area M2a set to the second voice data segment S2 is calculated. For example, cross-correlation function values for the mutual waveforms are calculated as similarities.
Next, as shown in FIG. 7B, the positions of the respective voice data segments Sn are adjusted.
Here, the above similarities are calculated at the positions where the positions on the time base, of the voice data segments Sn are moved along the time base. The respective voice data segments Sn are moved to the positions where the similarities calculated in its moving range become a maximum value.
When the second voice data segment S2 is moved within a predetermined range along the time base as shown in FIG. 7B for example, the position of the second voice data segment S2 is adjusted to a position shifted from an initial position by a predetermined interval d in such a manner that the similarity between the waveform of voice data in the second similarity calculation area M1b of the first voice data segment S1 and the waveform of voice data in the first similarity calculation area M2a of the second voice data segment S2 becomes a maximum value. This processing is sequentially executed on the respective voice data segments Sn to adjust the positions on the time base, of the voice data segments Sn.
Next, as shown in FIG. 7C, for example, the same area as the first similarity calculation area Mna set as described above is set as a first overlap area Sna. For example, the same area as the second similarity calculation area Mnb is set as a second overlap area Snb.
Thereafter, as shown in FIG. 7D, the first and second overlap areas Sna and Snb set as described above are sequentially combined so as to overlap each other along the time base, thereby generating voice playback data DS.
Thus, in the WSOLA method, the waveform of voice data in the second overlap area Snb of each voice data segment Sn and the waveform of voice data in the first overlap area S(n+1)b caused to overlap with its subsequent voice data segment Sn+1 are made similar to each other and combined together. Therefore, the voice playback data in which the voice data in the overlap areas Snb and S(n+1)b are combined, becomes continuous as compared with the OLA method and the voice is reproduced in the natural musical pitch.
There is however a case in which even in the case where the WSOLA method is applied, the voice playback data is reproduced unnaturally. Since, for example, the value of similarity between the waveform of the voice data in the second overlap area Snb of each voice data segment Sn and the waveform of the voice data in the first overlap area S(n+1)b caused to overlap with its subsequent voice data segment Sn+1 is small and similarity is poor, the voice might not be reproduced in the natural musical pitch.
Thus, when the voice data is converted to its corresponding voice playback data so as to correspond to the magnification of the playback velocity at the reproduction of the voice data, and the converted voice playback data is reproduced and outputted, the voice playback data becomes discontinuous and the voice quality might be deteriorated as in the case of the reproduction of voice in the unnatural musical pitch and the like.