This invention relates to a speech synthesizing apparatus for selecting and connecting speech segments to synthesize speech, on the basis of phonetic information to be subjected to speech synthesis, and also to a recording medium that stores a text-to-speech conversion program and can be read mechanically.
Attempts to make a computer recognize patterns or understand/express a natural language are now being executed. For example, a speech synthesizing apparatus is one means for producing speech by a computer, and can realize communication between computers and human beings.
Speech synthesizing apparatuses of this type have various speech output methods such as a waveform encoding method, a parameter expression method, etc. A rule-based synthesizing apparatus is a typical example which subdivides a sound into sound components, accumulates them and combines them into an optional sound.
Referring now to FIG. 1, a conventional example of the rule-based synthesizing apparatus will be described.
FIG. 1 is a block diagram illustrating the conventional rule-based synthesizing apparatus. This apparatus performs text-to-speech conversion (hereinafter referred to as xe2x80x9cTTSxe2x80x9d), in which input text data (hereinafter referred simply to as a xe2x80x9ctextxe2x80x9d) is converted into a phonetic symbol string that consists of phoneme information (information concerning pronunciation) and prosodic information (information concerning the syntactic structure, lexical accent, etc. of a sentence), thereby creating speech from the phonetic symbol string. A TTS processing mechanism employed in the rule-based synthesizing apparatus of FIG. 1 comprises a linguistic processing section 32 for analyzing the language of a text 31, and speech synthesizing section 33 for performing speech synthesizing processing on the basis of the output of the linguistic processing section 32.
For example, rule-based synthesis of Japanese is generally executed as follows:
First, in the linguistic processing section 32, morphological analysis in which a text (including Chinese characters and Japanese syllabaries) input from a text file 31 is dissected into morphemes, and then linguistic processing such as syntactic structure analysis is performed. After that, the linguistic processing section 32 determines the xe2x80x9ctype of accentxe2x80x9d of each morpheme based on xe2x80x9cphoneme informationxe2x80x9d and the position of the accent. Subsequently, the linguistic processing section 32 determines the xe2x80x9caccent typexe2x80x9d of each phrase that serves as a pause during vocalization (hereinafter refereed to as a xe2x80x9caccent phrasexe2x80x9d).
The text data processed by the linguistic processing section 32 is supplied to the speech synthesizing section 33.
In the speech synthesizing section 33, first, a phoneme duration determining/processing section 34 determines the duration of each phoneme included in the above xe2x80x9cphoneme informationxe2x80x9d.
Subsequently, a phonetic parameter generating section 36 reads necessary speech segments from a speech segment storage 35 that stores a great number of pre-created speech segments, on the basis of the above xe2x80x9cphoneme informationxe2x80x9d. The section 36 then connects the read speech segments while expanding and contracting them along the time axis, thereby generating a characteristic parameter series for to-be-synthesized speech.
Further, in the speech synthesizing section 33, a pitch pattern creating section 37 sets a point pitch on the basis of each accent type, thereby performing linear interpolation between each pair of adjacent ones of a plurality of set point pitches, to thereby create the accent components of pitch. Moreover, the pitch pattern creating section 37 creates a pitch pattern by superposing the accent component with a intonation component which represents a gradual lowering of pitch.
Finally, a synthesizing filter section 38 synthesizes desired speech by filtering.
In general, when a person speaks, he or she intentionally or unintentionally vocalizes a particular portion of the speech as to make it easier to hear than other portions. The particular portion indicates, for example, where a word which serves an important role to indicate the meaning of the speech is vocalized, where a certain word is vocalized for the first time in the speech, or where a word which is not familiar to the speaker or to the listener is vocalized. It also indicates that where a word is vocalized, if another word that has a similar pronunciation to the first-mentioned one exists in the speech, the listener may mistake the meaning of the word. On the other hand, at a portion of the speech other than the above, a person sometimes vocalizes a word in a manner which is not so easy to be heard, or which is rather ambiguous. This is because the listener will easily understand the word even if it is vocalized rather ambiguously.
However, the conventional speech synthesizing apparatus represented by the above-described rule-based synthesizing apparatus has only one type of speech segment with respect to one, and hence speech synthesis is always executed using speech segments that have the same degree of xe2x80x9cintelligibilityxe2x80x9d. Accordingly, the conventional speech synthesizing apparatus cannot adjust the degree of the xe2x80x9cintelligibilityxe2x80x9d of synthesized sounds. Therefore, if only speech segments that have an average degree of hearing easiness are used, it is difficult for the listener to hear them where the word should be vocalized in a manner easy to hear as aforementioned. On the other hand, if only speech segments that have a high degree of hearing easiness are used, all portions of all sentences are vocalized with clear pronunciation, which means that the listener does not hear smoothly synthesized sounds.
In addition, there exists another type of conventional speech synthesizing apparatus, in which a plurality of speech segments are prepared for one type of synthesis unit. However, it also has the above-described drawback since different speech segments are used for each type of synthesis unit in accordance with the phonetic or prosodic context, but irrespective of the adjustment of xe2x80x9cintelligibilityxe2x80x9d.
The present invention has been developed in light of the above, and is aimed at providing a speech synthesizing apparatus, in which a plurality of speech segments of different degrees of intelligibility for each type of unit are prepared, and are changed from one to another in the TTS processing in accordance with the state of vocalization, so that speech is synthesized in a manner in which the listener can easily hear it and does not tire even after hearing it for a long time. The invention is also aimed at providing a mechanically readable recording medium that stores a text-to-speech conversion program.
According to an aspect of the invention, there is provided a speech synthesizing apparatus comprising: text analyzing means for dissecting and analyzing text data, subjected to speech synthesis, into to-be-synthesized units and analyzing each to-be-synthesized unit, thereby obtaining a text analysis result; a speech segment dictionary that stores speech segments prepared for each of a plurality of ranks of intelligibility; determining means for determining in which rank a present degree of intelligibility is included, on the basis of the text analysis result; and synthesized-speech generating means for selecting speech segments stored in the speech segment dictionary and each included in a rank corresponding to the determined rank, and then connecting the speech segments to generate synthetic speech.
According to another aspect of the invention, there is provided a mechanically readable recording medium storing a text-to-speech conversion program for causing a computer to execute the steps of: dissecting text data, to be subjected to speech synthesis, into to-be-synthesized units, and analyzing the units to obtain a text analysis result; determining, on the basis of the text analysis result, a degree of intelligibility of each the to-be-synthesized unit; and selecting, on the basis of the determination result, each speech segments of a degree corresponding to each of the to-be-synthesized units, from a speech segment dictionary, in which speech segments of the plurality of degree of intelligibility is stored, and connecting the speech segments to obtain synthetic speech.
According to a further aspect of the invention, there is provided a mechanically readable recording medium storing a text-to-speech conversion program for causing a computer to execute the steps of: dissecting text data, to be subjected to speech synthesis, into to-be-synthesized units, and analyzing the to-be-synthesized units to obtain a text analysis result for each to-be-synthesized unit, the text analysis result including at least one of information items concerning grammar, meaning, familiarity and pronunciation; determining a degree of intelligibility of each the to-be-synthesized unit, on the basis of the at least one of the information items concerning the grammar, meaning, familiarity and pronunciation; and selecting, on the basis of the determination result, each speech segments of a degree corresponding to each of the to-be-synthesized units, from a speech segment dictionary that stores speech segments of the plurality of degrees of intelligibility of each the to-be-synthesized unit, and connecting the speech segments to obtain synthetic speech.
In the above structure, the degree of intelligibility of a to-be-synthesized text is determined for each to-be-synthesized unit on the basis of a text analysis result obtained by text analysis, and speech segments of a degree corresponding to the determination result, which can be synthesized, are selected and connected, thereby creating corresponding speech. Accordingly, the contents of synthesized speech can be made easily understandable by using speech segments of a degree corresponding to a high intelligibility, for the portion of a text indicated by the text data, which is considered important for the users to estimate the meaning of the text, and using speech segments of a degree corresponding to a low intelligibility for other portions of the text.
Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out hereinafter.