(1) Field of the Invention
The present invention relates to a speech synthesis apparatus, in particular to an audio synthesis apparatus which can embed information.
(2) Description of the Related Art
Following a recent development of digital signal processing technology, a method of embedding watermark information using a phase modulation, an echo signal or an auditory masking has been developed for the purposes of preventing illegal copying of acoustic data, particularly music data, and of protecting copyrights. It is for guaranteeing that information is embedded into the acoustic data generated as content and only an authorized rights holder can use the content by a reproducing appliance to read out the information.
On the other hand, speech is not only speech data generated by human speeches but also speech data generated by a so-called speech synthesis. The speech synthesis technology which converts a character-string text into speech has been developed remarkably. A synthesized speech which well includes characteristics of a speaker recorded on a speech database, which becomes a basis, can be generated in a system of synthesizing speech using a speech waveform stored in a speech database without processing the speech waveform or in a system which constructs a control method of controlling a parameter of each frame using a statistic learning algorithm from a speech database such as a speech synthesis method using a Hidden Markov Model (HMM). That is to say, the synthesized speech allows disguising oneself as the speaker.
In order to prevent such arrogation, in the method of embedding information into the synthesized speech for each piece of audio data, it is significant not only to protect the copyrights such as for music data, but also to embed information, into the synthesized speech, for identifying the synthesized speech and a system used for the audio synthesis, and the like.
As a conventional method of embedding information into synthesized speech, there is a method of outputting synthesized speech by adding identification information for identifying that the speech is the synthesized speech by changing signal power in a specific frequency band of the synthesized speech, in a frequency band in which a deterioration of sound quality is difficult to be sensed when a person hears, that is outside the main frequency band of the speech signal (e.g. refer to First Patent Reference: Japanese Patent Publication No. 2002-297199 (pp. 3 to 4 , FIG. 2)). FIG. 1 is a diagram for explaining the conventional method of embedding information into synthesized speech as disclosed in the First Patent Reference. In a speech synthesis apparatus 12, a synthesized speech signal outputted from a sentence speech synthesis processing unit 13 is inputted to a synthesized speech identification information adding unit 17. The synthesized speech identification information adding unit 17 then adds identification information indicating that the synthesized speech signal is different from a speech signal generated by human speech to the synthesized speech signal, and outputs as a synthesized speech signal 18. On the other hand, in a synthesized speech identifying apparatus 20, an identifying unit 21 detects from the input speech signal about whether or not there is identification information. When the identifying unit 21 detects identification information, it is identified that the input speech signal is the synthesized speech signal 18 and the identification result is displayed on the identification result displaying unit 22.
Further, in addition to the method of using signal power in a specific frequency band, in a speech synthesis method of synchronizing waveforms for one period into a pitch mark and synthesizing into speech by connecting the waveforms, there is a method of adding information to speech by slightly modifying waveforms for a specific period at the time of connecting waveforms (e.g. refer to Second Patent Reference: Japanese Patent Publication No. 2003-295878). The modification of waveforms is setting an amplitude of the waveform for a specific period to a different value that is different from prosody information that is originally to be embedded, or switching the waveform for the specific period to a waveform whose phase is inverted, or shifting the waveform for the particular period from a pitch mark to be synchronized for a very small amount of time.
On the other hand, as a conventional speech synthesis apparatus, for the purpose of improving clarity and naturalness of speech, there is a speech synthesis apparatus which generates a fine time structure called micro-prosody in a fundamental frequency or in a phoneme in speech strength, that is found in natural speech of human speaking (e.g. refer to Third Patent Reference: Japanese Patent Publication No. 09-244678, and Fourth Patent Reference: Japanese Patent Publication No. 2000-10581). A micro-prosody can be observed within a range of 10 milliseconds to 50 milliseconds (at least 2 pitches or more) before or after phoneme boundaries. It is known from research papers and the like that it is very difficult to hear the distinctions within the range. Also, it is known that the micro-prosody hardly affects characteristics of a phoneme. As a practical observation range of micro-prosody, a range between 20 milliseconds to 50 milliseconds is considered. The maximum value is set to 50 milliseconds because experience shows that the length longer than 50 milliseconds may exceed a length of a vowel.