(1) Field of the Invention
The present invention relates to an audio restoration apparatus which restores a distorted audio (including speech, music, an alarm and a background audio such as an audio of a car) which has been distorted due to an audio recording failure, an intrusion of surrounding noises, an intrusion of transmission noises and the like.
(2) Description of the Related Art
Recently, our living space is becoming flooded with various types of audios including artificial sounds such as BGM playing in streets and alarms, and audios generated by artificial objects such as cars. This becomes a problem in view of safety, functionality and comfort. For example, at a train station in a big city, an announcement may not be heard due to departure bells, noises of trains, voices of surrounding people and the like. A voice through a mobile phone may not be heard due to surrounding noises. Bicycle's bells may not be heard due to noises of cars. Such being the case, safety, functionality and comfort are impaired.
In view of the above-mentioned changes in the social environment, there is a need to restore a distorted audio to a natural and listenable audio, and to provide a user with the restored audio. The distorted audio has been distorted due to an audio recording failure, an intrusion of environmental noises, an intrusion of transmission noises and the like. It is particularly important to restore the audio using an audio which is similar to the real audio in view of voice characteristic, voice tone, audio color, audio volume, reverberation characteristic, audio quality and the like.
There is a first conventional audio restoration method of restoring speech including a segment distorted due to instantaneous noises by replacing the distorted speech part with the waveform of a segment which is sequential in time (For example, refer to Reference 1: “Ichi-channel nyuuryoku shingo chu toppatsusei zatsuon no hanbetsu to jokyo (Determination and removal of instantaneous noises in a one-channel input signal)”, Noguchi and other three authors, March, 2004, Annual meeting of the Acoustical Society of Japan. FIG. 1 shows the conventional audio restoration method disclosed in the above-mentioned Reference 1.
In FIG. 1, in the speech extraction Step 3201, speech parts are extracted by removing the segment of instantaneous noises from the speech waveform distorted by the intrusion of the instantaneous noises. In the speech restoration Step 3202, the speech is restored by inserting the speech waveform of the segment, which is immediately before the extracted distorted segment where instantaneous noises are located, into the position where the distorted segment was located (the disclosure in pp. 655 and 656 of Reference 1 is relevant to the present invention).
There is a second conventional audio restoration method relating to a vehicle traffic information providing apparatus which is mounted on a vehicle, and which receives a radio wave indicating the vehicle traffic information sent from a broadcasting station and provides a driver with vehicle traffic information. The method is intended for restoring speech distorted due to an intrusion of transmission noises by means that a linguistic analysis unit restores a phoneme sequence, and then reading out the restored phoneme sequence through speech synthesis (For example, refer to Patent Reference 1: Japanese Laid-Open Patent Application No. 2000-222682). FIG. 2 shows the conventional audio restoration apparatus disclosed in Patent Reference 1.
In FIG. 2, a receiving apparatus 3302 receives a radio wave of vehicle traffic information sent from the broadcasting station 3301 and converts it into a speech signal. A speech recognition apparatus 3303 performs speech recognition of the speech signal and converts it into language data. A linguistic analysis apparatus 3304 performs linguistic analysis compensating missing parts based on language data with same contents which is repeatedly outputted from the speech recognition apparatus 3303 (the disclosures in claim 2, and FIG. 1 of Patent Reference 1 are relevant to the present invention). A speech synthesis apparatus 3305 reads out information, which is judged as necessary, through speech synthesis. The information is among information of traffic statuses represented by the phoneme sequence restored by the linguistic analysis apparatus 3304.
There is a third conventional audio restoration method relating to a speech packet interpolation method of interpolating a missing part using a speech packet signal inputted before the input of the missing part. The method is intended for interpolating the speech packet corresponding to the missing part by calculating a best-match waveform with regard to the speech packet signal inputted before the input of the missing part by means of non-standardized differential operation processing, each time of inputting a sample value corresponding to a template (For example, refer to Patent Reference 2: Japanese Laid-Open Patent Application No. 2-4062 (claim 1)).
There is a fourth conventional audio restoration method relating to speech communication where packets are used. In the method, the following are used: a judgment unit which judges whether or not speech signal data sequence to be inputted includes a missing segment and outputs a first signal indicating the judgment; a speech recognition unit which performs speech recognition of the speech signal data sequence to be inputted using an acoustic model and a language model, and outputs the recognition result; a speech synthesis unit which performs speech synthesis based on the recognition result of the speech recognition unit, and outputs the speech signal; and a mixing unit which mixes the speech signal data sequence to be inputted and the output by the speech synthesis unit at a mixing rate which changes in response to the first signal, and output the mixing result (For example, refer to Patent Reference 3: Japanese Laid-Open Patent Application No. 2004-272128 (claim 1, and FIG. 1)). FIG. 3 shows the conventional audio restoration apparatus disclosed in the above-mentioned Patent Reference 3.
In FIG. 3, an input unit 3401 extracts speech signal data parts from the respective speech packets which are incoming and outgoing, and outputs them sequentially. The speech recognition unit 3404 performs speech recognition of the speech signal data to be outputted in time sequence from the input unit 3401 using an acoustic model for speech recognition 3402 and a language model 3403, and outputs the recognition results in time sequence. A monitor unit 3407 monitors the respective packets which are incoming and outgoing, and provides the speech recognition unit 3404 with supplemental information indicating whether or not a packet loss occurred. The speech synthesis unit 3406 performs speech synthesis using the acoustic model for speech synthesis 3405 based on the phoneme sequence outputted from the speech recognition unit 3404, and outputs a digital speech signal. A buffer 3408 stores outputs from the input unit 3401. A signal mixing unit 3409 is controlled by the monitor unit 3407, and selectively outputs one of (a) the outputs of the speech synthesis unit 3406 in a period corresponding to a packet loss and (b) the outputs of the buffer 3408 in periods other than the period corresponding to the packet loss.
However, the first conventional configuration has been conceived assuming that the audio to be restored has a waveform. Thus, the configuration makes it possible to restore an audio only in a rare case where the audio has a repeated waveform and a part of the repeated waveform has been lost. The configuration has drawbacks that: it does not make it possible to restore (a) many general audios which exist in a real environment and which cannot be represented in a waveform and (b) an audio to be restored which is entirely distorted.
In the second conventional configuration, a phoneme sequence is restored using knowledge regarding the audio structure through linguistic analysis when a distorted audio is restored. Therefore, it becomes possible to restore an audio linguistically even in the case where the audio to be restored is a general audio with a non-repeated waveform or an audio which is entirely distorted. However, there is no concept of restoring an audio using an audio which is similar to the real audio based on audio characteristic information such as speaker's characteristics, and voice characteristic. Therefore, the configuration has a drawback that it does not make it possible to restore an audio which sounds natural in a real environment. For example, in the case of restoring a voice of a Disk Jockey (DJ), the audio is restored using another person's voice stored in a speech synthesis apparatus.
In the third conventional configuration, a missing audio part is generated through a pattern matching at a waveform level. Therefore, the configuration has a drawback that it does not make it possible to restore a missing audio part in the case where the whole segment where the waveform changes has been lost. For example, it does not make it possible to restore an utterance of “Konnichiwa (Hello)” in the case where plural phonemes have been lost as represented by “Koxxchiwa” (Each x shows that there is a missing phoneme.)
In the fourth conventional configuration, knowledge regarding an audio structure of “language model” is used. Therefore, even in the case of an audio with missing phonemes, it makes it possible to estimate a phoneme sequence of an audio to be restored based on the context, and restoring the audio linguistically. However, there is no concept of extracting audio characteristics, which include voice characteristic, voice tone, audio volume, and reverberation characteristic, from an inputted speech, and restoring the speech based on the extracted audio characteristics. Therefore, the configuration has a drawback that it does not make it possible to restore a speech with high fidelity with respect to real audio characteristics in the case where voice characteristic, voice tone and the like of a person change from one minute to the next depending on the person's feeling and tiredness.
With those conventional configurations, it was impossible to restore a distorted audio using real audio characteristics, in the case where the distorted audio is a general audio which has a non-repeated waveform and exist in this real world.