The present invention relates generally to text-to-speech (TTS) reading systems. More particularly, the invention relates to a concatenative reading system that produces high quality, naturally articulated speech by taking into account the prosodic environment of the words to be concatenated and also the phonological features of adjacent words to provide natural-sounding intonation. The system is particularly useful in reading numbers in tables, spreadsheets and the like.
In the process of data entry into a computer from written records, proofreading is a tiring and time-consuming task. The data entry operator must constantly shift the eyes between the computer screen and the paper originals. Sometimes, if two people are available, they can share the proofreading task: one person reading the data out loud from the paper originals and the other checking the entry on the computer screen.
This process of proofreading data entry can be facilitated through use of a speech synthesis system. Such a system allows the operator to keep the eyes on the paper originals while listening to what has been entered. The operator does not need another person to read the data from the paper originals because the speech synthesis system handles this aspect. Thus the operator can work alone. However, current speech synthesis systems are fatiguing to use, because speech quality is poor, lacking natural-sounding phrasing and intonation. User fatigue leads to errors. Hence current speech synthesis systems have proven deficient for critical proofreading applications. User fatigue is particularly prevalent in number reading systems, where a monotonous tone and poor phrasing leads to many errors.
The present invention provides a reading system that has a very natural voice with which the data entry operator can work without fatigue. The reading system employs a concatenative technique whereby digitally recorded speech samples are concatenated or joined together to produce the speech output. The invention achieves a more life-like output by incorporating two variables of natural speech: (1) prosodic or intonational variation and (2) variation due to coarticulation of each word's initial and final phonemes with the final and initial phonemes of adjacent words. For each use of a word, a set of prosodic and segmental environment rules are applied to select a contextually appropriate digital sample. The result is a much more natural sounding synthesized speech that does not induce fatigue. Operators using the system thus enjoy a much lower error rate.
The system of the invention captures what a human speaker does while proofreading. It reads numbers in a column or row, using a nonfinal intonation for all but the last entry. This intonation gives the listener a cue that the current number is not the final one in the column or row. This contextual cue is extremely helpful in proofreading, as the user is cued when the final number in the column or row is reached. This information is very valuable in detecting insertion and deletion errors, where the text on the computer screen and the text on the paper originals do not have the same number of entries due to data entry error.
The invention comprises a high-quality concatenative reading system for converting an input string into a sequence for subsequent audible synthesis. The invention includes a dictionary of words stored in a computer-readable storage medium and a word list generator coupled to the dictionary. The word list generator is receptive of the input string for building and storing entries in a word list within the computer's memory. The word list generator builds the word list from words stored in the dictionary to correspond to the input string. The generator has a set of stored rules for adding numeric placeholder words that correspond to integers in the input string. Thus the word list generator will insert the appropriate numeric placeholders so that the integer number "1,243" will be pronounced "one thousand, two hundred forty-three."
The word list generator further includes a list of prosodic environment tokens that represent a plurality of intonation types. The word list generator assigns at least one of the prosodic environment tokens to at least some of the word list entries. The preferred embodiment assigns a prosodic environment token to each of the words in the word list.
The reading system also includes a database of speech samples stored in computer-readable memory. A phonological feature analyzer analyzes the word entries in the word list to determine the prosodic environment of those words. Specifically, the preferred embodiment consults a phonological feature table to determine what each word begins with and ends with. These features are compared with adjacent words to determine the phonological environment of each word. In natural speech, phonemes are pronounced differently in different phonological contexts. The adjacent phonemes affect how a phoneme will sound when spoken. In this case, the invention concentrates on the beginning and ending phonemes, altering the pronunciation based on the words that precede and follow each word entry.
Using the word list constructed by the word list generator, together with the prosodic environment information and phonological feature information, the reading system constructs a sample list from the database of speech samples. The sample list represents the actual sampled data that are concatenated to supply the sequence for audible synthesis. The sample list may be output through a digital-to-analog converter to produce an audible signal that may be amplified and played through a suitable speaker system.
For a more complete understanding of the invention, its objects and advantages, reference may be had to the following specification and to the accompanying drawings.