Generally speaking, a xe2x80x9cspeech synthesizerxe2x80x9d is a computer device or system for generating audible speech from written text. That is, a written form of a string or sequence of characters (e.g., a sentence) is provided as input, and the speech synthesizer generates the spoken equivalent or audible characterization of the input. The generated speech output is not merely a literal reading of each input character, but a language dependent, in-context verbalization of the input. If the input was the phone number (508) 691-1234 given in response to a prior question of xe2x80x9cWhat is your phone number?xe2x80x9d, the speech synthesizer does not produce the reading xe2x80x9cparenthesis, five hundred eight, close parenthesis, six hundred ninety-one . . . xe2x80x9d Instead, the speech synthesizer recognizes the context and supporting punctuation and produces the spoken equivalent xe2x80x9cfive (pause) zero (pause) eight (pause) six . . . xe2x80x9d just as an English-speaking person normally pronounces a phone number.
Historically the first speech synthesizers were formed of a dictionary, engine and digital vocalizer. The dictionary served as a look-up table. That is, the dictionary cross referenced the text or visual form of a character string (e.g., word or other unit) and the phonetic pronunciation of the character string/word. In linguistic terms the visual form of a character string unit (e.g., word) is called a xe2x80x9cgraphemexe2x80x9d and the corresponding phonetic pronunciation is termed a xe2x80x9cphonemexe2x80x9d. The phonetic pronunciation or phoneme of character string units is indicated by symbols from a predetermined set of phonetic symbols. To date, there is little standardization of phoneme symbol sets and usage of the same in speech synthesizers.
The engine is the working or processing member that searches the dictionary for a character string unit (or combinations thereof) matching the input text. In basic terms, the engine performs pattern matching between the sequence of characters in the input text and the sequence of characters in xe2x80x9cwordsxe2x80x9d (character string units) listed in the dictionary. Upon finding a match, the engine obtains from the dictionary entry (or combination of entries) of the matching word (or combination of words), the corresponding phoneme or combination of phonemes. To that end, the purpose of the engine is thought of as translating a grapheme (input text) to a corresponding phoneme (the corresponding symbols indicating pronunciation of the input text).
Typically the engine employs a binary search through the dictionary for the input text. The dictionary is loaded into the computer processor physical memory space (RAM) along with the speech synthesizer program. The memory footprint, i.e., the physical memory space in RAM needed while running the speech synthesizer program, thus must be large enough to hold the dictionary. Where the dictionary portion of today""s speech synthesizers continue to grow in size, the memory footprint is problematic due to the limited available memory (RAM and ROM) in some applications.
The digital vocalizer receives the phoneme data generated by the engine. Based on the phoneme data together with timing and stress data, the digital vocalizer generates sound signals for xe2x80x9creadingxe2x80x9d or xe2x80x9cspeakingxe2x80x9d the input text. Typically, the digital vocalizer employs a sound and speaker system for producing the audible characterization of the input text.
To improve on memory requirements of speech synthesizers, another design was developed. In that design, the dictionary is replaced by a rule set. Alternatively, the rule set is used in combination with the dictionary instead of completely substituting therefor. At any rate, the rule set is a group of statements in the form
IF (condition)-then-(phonemic result) Each such statement determines the phoneme for a grapheme that matches the IF condition. Examples of rule-based speech synthesizers are DECtalk by Digital Equipment Corporation of Maynard, Massachusetts and TrueVoice by Centigram Communications of San Jose, Calif.
Each rule (If-then statement) is the result of years of linguistic studies and are largely xe2x80x9cmanuallyxe2x80x9d generated. Thus the formation of a rule set is a very time consuming and language dependent process. Further, there are little standards in this area.
These and other problems exist in speech synthesizer technology. New solutions have been attempted but with little success. As a result, highly accurate speech synthesizers are yet to come.
In particular, typically a speech-synthesizer developer starts with a very large dictionary. A human linguist specializing in language and speech analysis examines words and their corresponding pronunciation in view of respective part of speech, spelling and other linguistic factors. As such, the linguist manually extracts rules from the dictionary or uses his knowledge of the language to create some rules. Next the speech-synthesizer developer removes from the dictionary the words that can be synthesized from the newly created rules. The more words able to be removed from the dictionary due to a created rule, the better.
The task of manually analyzing words of a language for grapheme-to-phoneme patterns is a laborious and painstaking one. It may take several months or years of manual human effort for a linguist to analyze a language and produce a grapheme-to-phoneme rule set for that language. Not only is this process lengthy and complicated, it is also prone to error where the created rules are manually typed into a rule file. Some of the typographical errors may be caught in compiling the rule file. Those that are not caught typically result in rules which will never be matched and hence never utilized during text-to-speech processing.
The foregoing problems of manual grapheme-to-phoneme rule generation are overcome by the present invention. The present invention provides a computer method and system for automated rule and rule set generation. Instead of having a human linguist manually analyze dictionary entries to determine grapheme-to-phoneme patterns and manually type the rules into a rule set, the present invention provides automatic digital processing means and a statistical approach to create the best possible rule set. In turn, the present invention enables creation of the smallest possible dictionary associated with a reasonable sized set of rules to substitute for a single huge dictionary of the prior art which cannot be used for many embedded applications. Even in the future, if large memory becomes inexpensive and available, the small memory footprint of the present invention rule set and resulting dictionary will still be an advantage since one may use the extra available memory to store other information such as a domain dictionary, abbreviation, phrase dictionary, etc.
In a preferred embodiment, each dictionary entry in an input dictionary is formed of (i) a sequence of one or more characters indicative of a subject character string, and (ii) a corresponding phonemic (phoneme string) representation formed of phonemic data parts. There is a different phonemic data part for each different subsequence of characters in the character string.
For each of the different subsequences of characters in the character string of a dictionary entry, the present invention (a) determines respective corresponding phonemic data parts found throughout the input dictionary for the subsequence of characters, and (b) from the determined respective corresponding phonemic data parts for the certain subsequence of characters, forms a grapheme-to-phoneme rule for indicating transformation from the certain subsequence of characters to at least one of the respective corresponding phonemic data parts. To that end the present invention generates grapheme-to-phoneme rules from the input dictionary. As such, the present invention effectively provides a rule generator.
By using the rule generator of the present invention, errors in rule creation are minimized, and a more accurate (less redundant and with fewer exceptions) set of rules is created. Also, rules may be generated by a computer according to the invention in far less time than manual human analysis of a dictionary.
In accordance with one aspect of the present invention, the input dictionary is formatted into a linked list. That is, each dictionary entry character string is linked to another entry character string to form a dictionary linked list.
Further, the determination of respective corresponding phonemic data parts of a subsequence of characters includes the steps of:
(a) for each character string entry in the dictionary linked list, comparing the character string entry to each of the succeeding character string entries in the dictionary linked list;
(b) for each comparison between a character string entry and a succeeding character string entry, determining a longest common subsequence of characters (preferably of three or more characters with one vowel) having a same respective location within the character string entries, the location being one of prefix, infix and suffix positions of a character string entry;
(c) storing in a linked list fashion, each determined longest common subsequence of characters and corresponding indication of location within the character string entries, each determined longest common subsequence of characters and its corresponding indication of location being a subgraph entry, such that a subgraph linked list is formed; and
(d) sorting the subgraph entries of the formed subgraph linked list such that the subgraph entry having the longest common subsequence of characters is first in the subgraph linked list, and any subgraph entry repeating another subgraph entry is omitted.
In the preferred embodiment, the step of sorting further includes, for subgraph entries having subsequences of a same length, sorting the subsequences alphabetically.
In addition, for each subgraph entry in the subgraph linked list, the invention (a) determines which character string entries from the dictionary input have the subsequence of characters in the corresponding location of the subgraph entry; (b) for each determined character string entry, forming a word match entry, including indicating the corresponding phoneme of the determined character string entry; and (c) linking the formed word match entries to each other and to the subgraph entry, such that a word match linked list is formed for and coupled to the subgraph entry.
In the preferred embodiment, the invention method and system further include the steps of:
(i) for each word match entry in the word match linked list of the subgraph entry, comparing the phoneme indicated in the word match entry to phonemes indicated in succeeding word match entries, and finding a largest common phonemic data part of a same relative location in the phonemes;
(ii) for each found largest common phonemic data part, determining an occurrence count of the number of word match entries in which the phonemic data part occurs;
(iii) for each found largest common phonemic data part, forming a subphone entry indicating (a) the found largest common phonemic data part, (b) its corresponding location in the phonemes in terms of prefix, infix and suffix positions, and (c) the determined occurrence count; and
(iv) linking the formed subphone entries to each other and to the subgraph entry, such that a subphone linked list is formed for and coupled to the subgraph entry.
In addition, for each word match entry in the word match linked list of a subgraph entry, the preferred embodiment
(a) selects from the subphone linked list of the subgraph entry, a subphone entry having phonemic data parts matching the phonemic data parts of the phoneme indicated in the word match entry and having a same corresponding location as the subgraph entry; and
(b) generates a grapheme-to-phoneme rule using the selected subphone entry, such that the rule indicates that the subsequence of characters in the subgraph entry occurring at its corresponding location within a character string, has a phonemic translation of the phonemic data parts of the selected subphone entry.
In accordance with another feature, the step of selecting a subphone entry further includes the steps of:
(a) if the corresponding location indicated in the subphone entry is prefix, verifying that a last phonemic data part of the subphone entry is a possible phonemic data part for a last character of the subgraph entry;
(b) if the corresponding location indicated in the subphone entry is suffix, verifying that a first phonemic data part of the subphone entry is a possible phonemic data part for a first character of the subgraph entry;
(c) if the corresponding location indicated in the subphone entry is infix, verifying that a last phonemic data part of the subphone entry is a possible phonemic data part for a last character of the subgraph entry, and that a first phonemic data part of the subphone entry is a possible phonemic data part for a first character of the subgraph entry;
(d) determining the subphone entry having a highest occurrence count; and
(e) verifying that length of the phonemic data parts of the subphone entry is greater than length of the sequence of characters in the subgraph entry plus or minus a predetermined amount, depending on the language of the dictionary.
As such, the preferred embodiment forms a grapheme-to-phoneme rule for indicating transformation from the subsequence of characters to the phonemic data part most frequently corresponding to the subsequence of characters throughout the input dictionary.