Generally, a voice synthesizer reads out a file in a text format composed of a character row including inputted characters, sentences, marks and figures or the like, refers to a dictionary making a plurality of voice waveform data into a library so as to convert the read character row into a voice, and for example, the voice synthesizer is used for a software application of a personal computer. In addition, in order to obtain a natural voice aurally, a voice emphasizing method for emphasizing a specific word in a sentence has been known.
FIG. 13 is a block diagram of a voice synthesizer without using a prominence (to emphasize a specific part). A voice synthesizer 100 shown in this FIG. 13 is configured by a pattern element analyzing unit 11, a word dictionary 12, a parameter generating unit 13, a waveform dictionary 14, and a pitch clipping and superimposing unit 15.
The pattern element analyzing unit 11 analyzes a pattern element (the minimum language unit composing a sentence or the minimum unit having a meaning in the sentence) with respect to the inputted kana-kanji mixed sentence (type-of-character mixed sentence) with reference to the word dictionary 12; decides types of a word (a division of parts of speech), reading of a word, accent or intonation, respectively; and outputs a phonetic symbol with a rhythm mark (an intermediate language). The file in the text format to be inputted in this pattern element analyzing unit 11 is a kana-kanji mixed character row in Japanese, and an alphabet string in English.
As well known, a generation model of a voiced sound (particularly, a vowel) is composed of a voice source (a voice cord), an articulation system (a vocal tract) and a radial opening (a lip); and a voice source signal is generated when the voice cord is oscillated by air from lungs. In addition, the vocal tract is composed of a part from the voice cord to a throat. A shape of the vocal tract is changed by making a diameter of the throat large or small, and when the vocal source signal is resonant with a specific shape of the vocal tract, a plurality of vowels is generated. Then, on the basis of this generation model, a property of a pitch period or the like to be described below is defined.
In this case, the pitch period represents an oscillation period of the voice cord, and a pitch frequency (also referred to as a basic frequency or merely referred to as a pitch) represents an oscillation frequency of the voice cord and a property with respect to a tone of a voice. In addition, the accent represents a temporal change of the pitch frequency of a word and the intonation represents a time dependency of the pitch frequency of the entire sentence. Then, these accent and intonation are physically and closely related to a pattern of time dependency of the pitch frequency. Specifically, the pitch frequency becomes higher at an accent position, and if the intonation is heightened, the pitch frequency becomes higher.
In many case, the voice that is synthesized, for example, a predetermined pitch frequency without using these information such as the accent or the like is read in a monotone, in other words, this voice becomes unnatural aurally like being read by a robot. Therefore, the voice synthesizer 100 outputs the phonetic symbol with a rhythm mark so that a natural pitch change can be generated at a succeeding stage of the processing. An example of the original character row and the intermediate language (the phonetic symbol with the rhythm mark) is described as follows.
A character row:                “akusentowapicchinojikantekihenkatokanrengaaru”.        
An intermediate language:                “a'ku%sentowa pi'cchio jikanteki he'nkato kanrenga&a'ru.”        
In this case, “'” represents an accent position, “%” represents an unvoiced consonant, “&” represents a nasal sonant, “.” represents a sentence boundary of an assertive sentence, respectively.
Further, “(full size space)” represents a division of a clause.
In other words, the intermediate language is outputted as a character row that is provided with the accent, the intonation, a phoneme duration or a pose duration or the like.
The word dictionary 12 stores (holds, accumulates or memorizes) the types of the word, the reading of the word, and a position of the accent or the like with related to each other.
The waveform dictionary 14 stores the voice waveform data of the voice itself (the phoneme waveform or the phoneme piece), a phoneme label showing which phoneme a specific part of the voice indicates, and a pitch mark indicating the pitch period with respect to the voiced sound.
The parameter generating unit 13 generates, provides or sets a parameter such as a pattern of the pitch frequency, the position of the phoneme, the phoneme duration, the pose duration and a intensity the voice (a voice pressure) or the like with respect to the character row. In addition, the parameter generating unit 13 decides which part of the voice waveform data in the voice waveform data stored in the waveform dictionary 14 is used. By this parameter, the pitch period and the position of the phoneme or the like are decided, and such the natural voice as a person is reading the sentence can be obtained.
The pitch clipping and superimposing unit 15 clips the voice waveform data stored in the waveform dictionary 14, and superimposes (overlaps) and adds the processed voice waveform data having the clipped voice waveform data multiplied by a window function or the like and a part of second voice waveform data belonging to a waveform section at the preceding and succeeding sides of the section (the waveform section) to which this processed voice waveform data belongs to synthesize the voice. As this processing method of the pitch clipping and superimposing unit 15, for example, a PSOLA (Pitch-Synchronous Overlap-add: a pitch conversion method due to addition and superimposing of the waveform) method is used (refer to “Diphone Synthesis Using and Overlap-add Technique for Speech Waveforms Concatenation”, ICASSP '86, pp. 2015-2018, 1986).
FIG. 15A to FIG. 15D illustrate an addition and superimposing method of a waveform, respectively. As shown in FIG. 15A, the PSOLA method clips the voice waveform data of two periods from the waveform dictionary 14 on the basis of the generated parameter, and then, as shown in FIG. 15B, the clipped voice waveform data is multiplied by the window function (for example, a Hanning window) to generate processed voice waveform data. Then, as shown in FIG. 15C, the pitch clipping and superimposing unit 15 superimposes and adds a last half of the preceding section of the present section and a first half of the succeeding section of the present section, and by superimposing and adding the last half of the present section and the first half of the succeeding section, a waveform of one period is synthesized (refer to FIG. 15D).
The above description is related to a synthesis when the prominence is not used.
In the next place, with reference to FIG. 14, the synthesis when the prominence is used will be described below.
Various voice synthesizers, which emphasize a specific part of the word or the like designated by a user by means of the prominence, are suggested (for example, Japanese Patent laid-Open HEI5-224689, hereinafter, referred to as a publicly known document 1).
FIG. 14 is a block diagram of a voice synthesizer using a prominence, and here, the prominence is manually inputted. A voice synthesizer 101 shown in this FIG. 14 is different from the voice synthesizer 100 shown in FIG. 13 in that an emphasized word manual inputting unit 26 to designate the setting data showing a part in the inputted sentence and a degree of emphasis by manual input is provided at the input and output side of the pattern element analyzing unit 11. In the meantime, except for the emphasized word manual inputting unit 26, the parts having the same reference numerals as the above-described parts have the same functions.
Then, a parameter generating unit 23 shown in FIG. 14 sets a higher pitch and a longer phoneme length than the voice part that is not emphasized with respect to the part designated by the emphasized word manual inputting unit 26 and generates a parameter to emphasize a specific word. In addition, the parameter generating unit 23 makes amplitude larger at the voice part to be emphasized or generates a parameter such as locating a pose before or after the voice part.
Further, conventionally, many voice emphasizing methods have been suggested.
For example, another voice synthesizing method using the prominence is disclosed in JP-A-5-80791 or the like.
Further, in Japanese Patent Laid-Open HEI5-27792 (hereinafter, referred to as a publicly known document 2), a voice emphasizing apparatus to emphasize a specific key word by providing a key word dictionary (a level of importance dictionary) that is different from reading of the text sentence. This voice emphasizing apparatus disclosed in the publicly known document 2 inputs the voice therein and uses key word detection extracting a characteristic amount of the voice such as a spectrum or the like on the basis of the digital voice waveform data.
However, when using the voice emphasizing method disclosed in a publicly known document 1, the user has to input the prominence manually each time the part to be emphasized appears, so that this involves a problem that the operation becomes complex.
Further, the voice emphasizing apparatus disclosed in the publicly known document 2 does not change an emphasizing level in multi-stages but extracts the key word on the basis of the voice waveform data. Accordingly, there is also a possibility that the operationality is not enough.