1. Field of Technology
The present invention relates to a speech synthesis method and apparatus, and in particular to a speech synthesis method and apparatus whereby words, phrases or short sentences can be generated as natural-sounding synthesized speech having accurate rythm and intonation characteristics, for such applications as vehicle navigation systems, personal computers, etc.
2. Prior Art
In generating synthesized speech from input data representing a speech item such as a word, phrase or sentence, the essential requirements for obtaining natural-sounding synthesized speech are that the rythm and intonation be as close as possible to those of that speech item when spoken by a person. The rythm of an enunciated speech item, and the average speed of enunciating its syllables, are defined by the respective durations of the sequence of morae of that speech item. Although the term xe2x80x9cmoraexe2x80x9d is generally applied only to the Japanese language, the term will be used herein in with a more general meaning, as signifying xe2x80x9crythm intervalsxe2x80x9d, i.e., durations for which respective syllables of speech item are enunciated.
The classification of respective sounds as xe2x80x9csyllablesxe2x80x9d depends upon the particular language in which speech synthesis is being performed. For example, English does not have a syllable that is directly equivalent to the the Japanese syllable xe2x80x9cNxe2x80x9d (the syllabic nasal), which is considered to occupy one mora in spoken Japanese. Furthermore the term xe2x80x9caccentxe2x80x9d or xe2x80x9caccented syllablexe2x80x9d as used herein is to be understood as signifying, in the case of Japanese, a syllable which exhibits an abrupt drop in pitch. However in the case of English, the term xe2x80x9caccentedxe2x80x9d is to be understood as applying to a syllable or word which is stressed. i.e. for which there is an abrupt increase in speech power. Thus although speech item examples used in the following description of embodiments of the invention are generally in Japanese, the invention is not limited in its application to that language.
One prior art system which is concerned with the problem of determining the rythm of synthesized speech is described in Japanese patent HEI 6-274195 (Japanese Language Speech Synthesis System forming Normalized Vowel Lengths and Consonant Lengths Between Vowel Center-of-Gravity Points). With that prior art system as shown in FIG. 21, a rule-based method is utilized, whereby the time interval between the vowel energy center-of-gravity points of the respective vowels of two mutually adjacent morae formed of a leading syllable 11 and a trailing syllable 12 is taken as being the morae interval between these syllables, and the value of that morae interval is determined by using the consonant which is located between the two morae and the pronunciation speed as parameters. The respective durations of each of the vowels of the two morae are then inferred, by using as parameters the vowel energy center-of-gravity interval and the consonant durations.
Another example of prior art systems for synthesized speech is described in Japanese patent HEI 7-261778 (Method and Apparatus for Speech Information Processing), whereby respective pitch patterns can be generated for words which are to be speech-synthesized. Such a pitch pattern defines, for each phoneme of a word, the phoneme duration and the form of variation of pitch in that phoneme. With the first embodiment of that invention, a pitch pattern is generated for a word by a process of:
(a) predeterming the respective durations of the phonemes of the word,
(b) determining the number of morae and the position of any accented syllable (i.e., the accent type) of the word,
(c) predetermining certain characteristic amounts, i.e., values such as reference values of pitch and speech power, for the word,
(d) for each vowel of the word, looking up a pitch pattern table to obtain respective values for pitch at each of a plurality of successive time points within the vowel (these pitch values for a vowel being obtained from the pitch pattern table in accordance with the number of morae of the word, the mora position of that vowel and the position of any accented syllable in the word), and
(e) within each vowel of the word, deriving interpolated values of pitch by using the set of pitch values obtained for that vowel from the pitch pattern table.
Interpolation from the vowel pitch values can also be applied to obtain the pitch values of any consonants in the word.
As shown in FIG. 22, that system includes a speech file 21 having stored therein a speech database of words which are expressed in a form whereby the morae number and accent type can be determined, with each word being assigned a file number. A word which is to be speech-synthesized is first supplied to a features extraction section 22, a label attachment section 23 and a phoneme list generating section 14. The label attachment section 23 determines the starting and ending time points for audibly generating each of the phonemes constituting the word. This operation is executed manually, or under the control of a program. The phoneme list generating section 14 determines the morae number and accent type of the word, and the information thus obtained by the label attachment section 23 and phoneme list generating section 14, labelled with the file number of the word, are combined to form entries for the respective phonemes of the word in a table that is held in a label file 16.
A characteristic amounts file 25 specifies such characteristic quantities as center values of fundamental frequency and speech power which are to be used for the selected word. The data which have been set into the characteristic amounts file 25 and label file 16 for the selected word are supplied to a statistical processing section 27, which contains the aforementioned pitch pattern table. The aforementioned respective sets of frequency values for each vowel of the word are thereby obtained from the pitch pattern table, in accordance with the environmental conditions (number of morae in word, mora position of that vowel, accent type of the word) affecting that vowel, and are supplied to a pitch pattern generating section 28. The pitch pattern generating section 28 executes the aforementioned interpolative processing to obtain the requisite pitch pattern for the word.
FIG. 23 graphically illustrates a pitch pattern which might be derived by the system of FIG. 22, for the case of a word xe2x80x9cazixe2x80x9d. The respective durations which have been determined for the three phonemes of this word are indicated as L1, L2, L3, and it is assumed that three pitch values are obtained by the statistical processing section 27 for each vowel, these being indicated as f1, f2, f3 for the leading vowel xe2x80x9caxe2x80x9d, with all other pitch values being derived by interpolation.
It will be apparent that it is necessary to derive the sets of values to be utilized in the pitch pattern table of the statistical processing section 27 by statistical analysis of large amounts of speech patterns, and the need to process such large amounts of data in order to obtain sufficient accuracy of results is a disadvantage of this method. Furthermore, although the resultant information will specify average forms of pitch variation, such an average form of pitch variation may not necessarily correspond to the actual intonation of a specific word in natural speech.
With the prior art method of FIG. 21 on the other hand, the rythm of the resultant synthesized speech, i.e., the rythm within a word or sentence, is determined only on the basis of assumed timing relationships between each of respective pairs of adjacent morae, irrespective of the actual rythm which the word or sentence would have in natural speech. Hence it will be impossible to generate synthesized speech having a rythm which is close to that of natural speech.
There is therefore a requirement for a speech synthesis system whereby the resultant synthesized speech is substantially close to natural speech in its rythm and intonation characteristics, but which does not require the acquisition, processing and storage of large amounts of data to achieve such results and therefore would be suited to small-scale types of application such as vehicle navigation systems, personal computers, etc.
It is an objective of the present invention to overcome the disadvantages of the prior art described above by providing a method and apparatus for speech synthesis whereby synthesized speech can be reliably generated in which the rythm, speech power variations and pitch variations are close to those of natural speech, without requirements for executing complex processing operations on large amounts of data or for storing large amounts of data.
The basis of the present invention lies in the use of prosodic templates, each consisting of three sets of data which respective express specific rythm, pitch variation, and speech power variation characteristics. Each prosodic template is generated by a human operator, who first enunciates into a microphone a sample speech item (or listens to the item being enunciated), then enunciates a series of repetitions of a single syllable, referred to herein as the reference syllable, with these enunciations being as close a possible in rythm, pitch variations and speech power variations to those of the sample speech item. The resultant acoustic waveform is analyzed to extract data expressing, the rythm, the pitch variation, and the speech power variation characteristics of that sequence of enunciations, to constitute in combination a prosodic template. In addition, the number of morae and accent type of the sequence of enunciations of the reference syllable are determined.
To achieve the above objective, the basic features of the present invention are as follows:
(1) Generating and storing in memory beforehand a plurality of such prosodic templates, derived for respectively different sample speech items, and classified in accordance with number of morae and accent type,
(2) Thereafter, converting a set of primary data which express an object speech item in the form of text or a rythm alias into an acoustic waveform expressing speech, by successive steps of:
(a) judging the number of morae and the accent type of the speech item,
(b) selecting one of the stored prosodic templates which has an identical number of morae and accent type to the speech item,
(c) generating a sequence of acoustic waveform segments which express the sequence of syllables constituting the object speech item,
(d) shaping these acoustic waveform segments such as to bring the rythm of the object speech item close to that of the selected prosodic template,
(e) shaping the resultant acoustic waveform segments such as to bring the pitch variation and speech power variation characteristics of the object speech item close to those of the selected prosodic template, and
(f) linking the resultant shaped acoustic waveform segments into a continuous waveform.
Preferably, the invention should be applied to speech items having no more than nine morae.
The invention provides various ways in which the rythm of an object speech item can be matched to that of a selected prosodic template. For example the rythm data of a stored prosodic template may express only the respective durations of the vowel portions of each of the reference syllable repetitions. In that case, each portion of the acoustic waveform segments which expresses a vowel of the object speech item is subjected to waveform shaping to make the duration of that vowel substantially identical to that of the corresponding vowel expressed in the selected prosodic template.
Alternatively, the rythm data set of each stored prosodic template may express only the respective intervals between adjacent pairs of reference time points which are successively defined within the sequence of enunciations of the reference syllable. Each of these reference time points can for example be the vowel energy center-of-gravity point of a syllable, or the starting point of a syllable, or the auditory perceptual timing point (described hereinafter) of that syllable. In that case, the acoustic waveform segments which express the object speech item are subjected to waveform shaping such as to make the duration of each interval between a pair of adjacent ones of these reference time points substantially identical to the duration of the corresponding interval which is specified in the selected prosodic template.
The data expressing a speech power variation characteristic, in each stored prosodic template, can consist of data which specifies the respective peak values of each sequence of pitch waveform cycles constituting a vowel portion of a syllable. In that case, the speech power characteristic of the object speech item is brought close to that of the selected prosodic template by executing waveform shaping of each of the pitch waveform cycles constituting each vowel portion expressed by the acoustic waveform segments, such as to make each peak value of a pitch waveform cycle match the peak value of the corresponding pitch waveform cycle in the corresponding vowel as expressed by the speech power data of the selected prosodic template.
Alternatively, the data expressing the speech power variation characteristic expressed in a prosodic template can consist of data which specifies the respective average peak values of each set of pitch waveform cycles constituting a vowel portion of an enunciation of the reference syllable. In that case, the speech power characteristic of the object speech item is brought close to that of the selected prosodic template by executing waveform shaping of the pitch waveform cycles constituting each vowel expressed by the acoustic waveform segments, such as to make each peak value substantially identical to the average peak value of the corresponding vowel portion that is expressed by the speech power data of the prosodic template.
In addition the data expressing a pitch variation characteristic, of each stored prosodic template, can consist of data which specifies the respective pitch periods of each set of pitch waveform cycles constituting a vowel portion of an enunciation of the reference syllable. In that case, the pitch characteristic of the object speech item is brought close to that of the selected prosodic template by executing waveform shaping of each of the pitch waveform cycles constituting each vowel portion expressed by the acoustic waveform segments, such as to make each pitch period substantially identical to the that of the corresponding pitch waveform cycle in the corresponding vowel portion which is expressed by the pitch data of the selected prosodic template.
Furthermore, in addition to adjustment of the pitch of vowels of the object speech item, it is also possible to adjust the pitch of each voiced consonant of the object speech item to match that of the corresponding portion of the selected prosodic template.
As a further alternative, each vowel portion of a syllable expressed by a prosodic template is divided into a plurality of sections, such as three or four sections, and respective average values of pitch period and average values of peak value are derived for each of these sections. The pitch period average values are stored as the pitch data of a prosodic template, while the peak value average values are stored as the speech power data of the template. In that case, the pitch characteristic of an object speech item is brought close to that of the selected prosodic template by dividing each vowel into the aforementioned plurality of sections and executing waveform shaping of each of the pitch waveform cycles constituting each section, as expressed by the aforementioned acoustic waveform segments, to make the pitch period in each of these vowel sections substantially identical to the average pitch period of the corresponding section of the corresponding vowel portion as expressed by pitch data of the selected prosodic template.
Similarly, the speech power characteristic of the object speech item is brought close to that of the selected prosodic template by executing waveform shaping of each of the pitch waveform cycles constituting each section of each vowel expressed by the acoustic waveform segments such as to make the peak value throughout each of these vowel sections substantially identical to the average peak value of the corresponding section of the corresponding vowel portion as expressed by the speech power data of the selected prosodic template.