1. Field of the Invention
The present invention generally relates to a method and system for providing an improved ability to create a cohesive script for generating a speech corpus (e.g., voice database) for concatenative Text-To-Speech synthesis (“concatenative TTS”), and more particularly, for providing improved quality of that speech corpus resulting from greater fluency and more-natural prosody in the recordings based on the cohesive script.
For purposes of this disclosure, “phoneme” means the smallest unit of speech used in linguistic analysis. For example, the sound represented by “s” is a phoneme. However, for generality, where “phoneme” appears below it can refer to shorter units, such as fractions of a phoneme e.g. “burst portion of t” or “first ⅓ of s”, or longer units, such as syllables.
Also, the sounds represented by “sh” or “k” are examples of phonemes which have unambiguous pronunciations. It is noted that phonemes (e.g., “sh”) are not equivalent to the number of letters. That is, two letters (e.g., “sh”) can make one phoneme, and one letter, “x”, can make two phonemes, “k” and “s”.
As another example, English speakers generally have a repertoire of about 40 phonemes and utter about 10 phonemes per second. However, the ordinarily skilled artisan would understand that the present invention is not limited to any particular language (e.g., English) or number of phonemes (e.g., 40). The exemplary features described herein with reference to the English language are for exemplary purposes only.
For purposes of this disclosure, “concatenative” means joining together sequences of recordings of phonemes. “Phonemes” include linguistic units, e.g. there is one phoneme “k”. However, a concatenative system will employ many recordings of “k”, such as one from the beginning of “kook” and another from “keep”, which sound considerably different.
Also, for purposes of this disclosure, a “text database” means any collection of text, for example, a collection of existing sentences, phrases, words, etc., or combinations thereof. A “script” generally means a written text document, or collection of words, sentences, etc., which can be read by a professional speaker to generate a speech database, or a speech corpus (or corpora). A “speech corpus” (or “speech corpora”) generally means a collection of speech recordings or audio recordings (e.g., which are generated by reading a script).
2. Description of the Conventional Art
Conventional systems have been developed to perform concatenative TTS. Generally, in conventional methods and systems, the first step in creating a speech corpus for concatenative TTS software is recording a professional speaker reading a very large “script”. Such scripts typically can include about 10,000 sentences. Thus, this first step can take two to three weeks to complete.
The conventional script generally is made up largely of words and phrases that are chosen for their diverse phoneme content, to ensure ample representation of most or all of the English phoneme sequences.
A conventional method of generating the script (i.e., gathering these phonemically-rich sentences), is by data mining. For purposes of this disclosure, “data mining” generally includes, for example, searching through a very large text database to find words or word sequences containing the required phoneme sequences.
The conventional methods, however, have several drawbacks or disadvantages. For example:
1) A database sufficiently large to deliver the required phonemic content generally may contain many sentences with grammatical errors, poor writing, non-English words, and other impediments to smooth oral delivery by the speaker.
2) The conventional systems and methods generally are extremely inefficient.
For example, a rare phoneme sequence may be found embedded in a 20-word sentence. Thus, incorporating this 20-word sentence into the script provides one useful word but also drags 19 superfluous words along with it. Thus, the length of the script is undesirably increased. Omitting the superfluous words would preclude smooth reading of sentences.
Scripts that are generated by conventional methods and systems contain numerous examples of this problem. That is, a script is generated by conventional means to include a long difficult sentence solely for the purpose of providing one essential word (or phrase, etc.).
3) In conventional methods and systems, because sentences are chosen independently of each other, it follows that they can be (and generally are) very dissimilar in subject matter, writing quality, word count, sentence structure, etc. Such dissimilarities provide the speaker with a very difficult reading task.
For example, rather than one sentence flowing sensibly into another, as ordinary prose generally does, a script developed according to the conventional methods and systems can read more like a hodgepodge of often awkward sentences that are stripped of their original context. Thus, professional speakers who are called upon to read these conventional scripts, for example, for three hours or more in a single stretch of time, usually consider the task to be an onerous one, which can affect the quality of the reading.
4) In conventional methods and systems, it generally is difficult to read the script generated by conventional methods and systems very well.
For example, there generally is no overarching or overall meaning, so it can be difficult for the speaker to know what to emphasize or how to give natural prosody to the script. Such dissimilar material lends itself to inconsistent reading style, which creates inconsistencies in the corpus (e.g., speech corpus generated by reading the script) which harms TTS quality.
Also, since the speaker's reading prosody will be analyzed and ultimately incorporated into the product, this lack of natural reading prosody has a deleterious effect on the final TTS output.
Applicants have recognized that, as the focus of advancement of TTS technology progresses from segmental quality to prosody and expression, such awkward material generated by the conventional methods and systems becomes a greater and greater hindrance to the improvement of the art.
The conventional methods and systems have not addressed or provided any acceptable solutions to this problem other than, for example, merely minimizing the problem (instead of solving the problem) using stopgap measures such as editing the script by hand. Applicant has recognized that such conventional methods and systems, for example, using stopgap measures, increasingly are impractical because computer memory and computation power continually enable datasets to expand.
5) Moreover, Applicants have recognized that, even if the speaker were to overcome the onerous-reading problem, the conventional hodgepodge of often awkward sentences also makes it difficult to gather a speech corpus which provides examples of the prosody unique to longer coherent passages, such as paragraph-level phenomena, de-accenting of repeated words as a function of how recently they had appeared, etc.
While a search could be made to gather paragraphs instead of sentences, the problem of incorporating a paragraph (or paragraphs) into the script to provide one example of paragraph-level phenomena would drag superfluous words and/or sentences along with it. Thus, the length of the script undesirably would be increased, thereby exacerbating the problem described above, which respect to dragging superfluous words into the script.
Practically, one approach used to address this problem is to have separate text database sections—one focused on phonemic coverage, and another on longer-passage fluency. However, this approach is undesirable, for example, because it is inefficient, in that neither of the separate text database sections contributes to the measured coverage of the other.