1. Field of the Invention
The present invention relates to a record sentence generation method, and more particularly, to a method for automatically generating a record sentence that is a subject of speech corpus building.
2. Description of the Related Art
Speech synthesis is the conversion of a visually recognizable sentence of text into an acoustically recognizable sentence of speech. Speech synthesis is generally used in automatic response systems, mobile phone number retrieval, and automatic announcement systems in public places.
A conventional speech synthesis apparatus extracts text information from a sentence of text, selects the most appropriate prerecorded vocal elements according to the extracted text information, and combines the selected vocal elements to generate a sentence of speech. Here, a speech unit obtained by dividing prerecorded speech into parts of a predetermined size is referred to as a candidate synthesis unit.
A synthesis unit database is established according to a database referred to as a speech corpus. The speech corpus is established by prerecording common source or frequently used sentences. For example, the sources may be novels, news articles, and academic publications, etc. A speech synthesis method according to the above-described type of speech corpus is referred to as corpus-based speech synthesis (CSS).
The quality of speech synthesized by CSS depends on the method of establishing the speech corpus and the amount of speech stored in the speech corpus. However, since it is impossible to store all possible sentences of speech in a speech corpus, there is inevitably quality degradation due to an unseen unit in a synthesized sentence. For example, when a speech unit of satisfactory quality cannot be obtained from candidate synthesis units extracted from a speech corpus by a speech synthesizer, a less-than-satisfactory candidate synthesis unit is selected as a synthesis unit and referred to as an “unseen unit”.
The unseen unit is a major cause of quality degradation of a synthesized sentence of speech. To solve the unseen unit problem, U.S. Pat. No. 6,505,158 suggests a likely unit replacement method and Korean Patent Application No. 2001-95385 suggests a method using a multi-stage synthesis unit.
For example, in the likely unit replacement method, a most likely candidate synthesis unit is selected and used for replacement according to the likeness between a current phoneme and preceding and succeeding phonemes. For example, in the method using a multi-stage synthesis unit, when there is no desired candidate synthesis unit, a smaller synthesis unit is selected and used for replacement.
However, in the likely unit replacement method, even when the likeness is high, phoneme transition, and the like may cause phonemes to have totally different sound values such that the method cannot prevent degradation of speech quality. When the replacement unit is also an unseen unit, replacement itself becomes impossible. Also, in the method using a multi-stage synthesis unit, the smaller the unit used in synthesis, the larger the probability of errors occurring in the connection part, and when the replacement unit is also an unseen unit, replacement itself becomes impossible.
Accordingly, the most basic method for solving the unseen unit problem is to maximize the efficiency of a speech corpus. The efficiency of a speech corpus may be increased by building the speech corpus such that a relatively small number of sentences of speech can cover a large number of unseen units. Thus, a script to be read by a voice actor, that is, record sentences, must be selected appropriately such that a small number of record sentences cover a large number of unseen units.
FIG. 1 is a diagram showing a conventional method of establishing a speech corpus.
A text database 110 having sentences of text extracted from various books and publications is established. The text database 110 includes sentences of text and additional information including syntax and morpheme information on the sentences of text. A sentence extracted from the text database 110 is converted into a sentence of speech with a speech signal waveform by being spoken by a voice actor and recorded. The converted sentences of speech and related information form a speech corpus 100. The established speech corpus 100 includes information on a sentence of text underlying a sentence of speech, additional information on the sentence of text, a signal waveform indicating the sentence of speech, mapping information between the sentence of speech and the sentence of text, and the label of a phoneme included in the sentence of speech.
The established speech corpus 100 is used to build a synthesis database 120 which is used in a variety of speech synthesis fields. The synthesis database 120 is included inside a speech synthesizer, and is formed with information extracted from the speech corpus and processed appropriately for a particular application field.
However, the conventional method for establishing a speech corpus has an omnidirectional structure in which the steps of establishing the text database 110, selecting appropriate record sentences from the text database 110, recording and storing the selected record sentences to form the speech corpus 100, and using the speech corpus 100 to form the synthesis database 120 are performed only in one direction. Accordingly, unseen unit problems caused by new speech synthesis performed after the speech corpus 100 is built cannot be solved.