1. Field of the Invention
The present invention relates to speech synthesis systems and more particularly to algorithms and methods used to produce a viable speech rendition of text.
2. Description of the Prior Art
Phonology involves the study of speech sounds and the rule system for combining speech sounds into meaningful words. One must perceive and produce speech sounds and acquire the rules of the language used in one's environment. In American English a blend of two consonants such as “s” and “t” is permissible at the beginning of a word but blending the two consonants “k” and “b” is not; “ng” is not produced at the beginning of words; and “w” is not produced at the end of words (words may end in the letter “w” but not the sound “w”). Marketing experts demonstrate their knowledge of phonology when they coin words for new products; product names, if, chosen correctly using phonological rules, are recognizable to the public as rightful words. Slang also follows these rules. For example, the word “nerd” is recognizable as an acceptably formed noun.
Articulation usually refers to the actual movements of the speech organs that occur during the production of various speech sounds. Successful articulation requires (1) neurological integrity, (2) normal respiration, (3) normal action of the larynx (voice box or Adam's apple), (4) normal movement of the articulators, which include the tongue, teeth, hard palate, soft palate, lips, and mandible (lower jaw), and (5) adequate hearing.
Phonics involves interdependence between the three cuing systems: semantics, syntax, and grapho-phonics. In order to program words and use phonics as the tool for doing that, one has to be familiar with these relationships. Semantic cues (context: what makes sense) and syntactic cures (structure and grammar: what sounds right grammatically) are strategies the reader needs to be using already in order for phonics (letter-sound relationships: what looks right visually and sounds right phonetically) to make sense. Phonics proficiency by itself cannot elicit comprehension of text. While phonics is integral to the reading process, it is subordinate to semantics and syntax.
There are many types of letter combinations that need to be understood in order to fully understand how programming a phonics dictionary would work. In simple terms, the following letter-sound relationships need to be developed: beginning consonants, ending consonants, consonant digraphs (“sh,” “th,” “ch,” “wh”), medial consonants, consonant blends, long vowels and short vowels.
Speech and language pathologists generally call a speech sound a “phoneme”. Technically, it is the smallest sound segment in a word that we can hear and that, when changed, modifies the meaning of a word. For example the word “bit” and “bid” have different meanings yet they differ in their respective sounds by only the last sound in each word (i.e., “t” and “d”). These two sounds would be considered phonemes because they are capable of changing meaning. Speech sounds or phonemes are classified as vowels and consonants. The number of letters in a word and the number of sounds in a word do not always have a one-to-one correspondence. For example, in the word “squirrel”, there are eight letters, but there are only five sounds: “s”-“k”-“w”-“r”-“l.”
A “diphthong” is the sound that results when the articulators move from one vowel to another within the same syllable. Each one of these vowels and diphthongs is called a speech sound or phoneme. The vowel sounds are a, e, i, o, u, and sometimes y, but when we are breaking up words into sounds they may be five or six vowel letters, but approximately 17 distinct vowel sounds. One should note that there are some variations in vowel usage due to regional or dialectical differences.
Speech-language pathologists often describe consonants by their place of articulation and manner of articulation as well as the presence or absence of voicing. Many consonant sounds are produced alike, except for the voicing factor. For instance, “p” and “b” are both bilabial stops. That is, the sounds are made with both lips and the flow of air in the vocal tract is completely stopped and then released at the place of articulation. It is important to note, however, that one type of consonant sound is produced with voicing (the vocal folds are vibrating) and the other type of consonant sound is produced without voicing (the vocal folds are not vibrating).
The concepts described above must be taken into account in order to enable a computer to generate speech which is understandable to humans. While computer generated speech is known to the art, it often lacks the accuracy needed to render speech that is reliably understandable or consists of cumbersome implementations of the rules of English (or any language's) pronunciation. Other implementations require human annotation of the input test message to facilitate accurate pronunciation. The present invention has neither of these limitations.