1. Field
In linguistics, prosody is concerned with those elements of speech, which are not individual phonetic segments (vowels and consonants), but are properties of syllables and larger units of speech. Such elements of speech contribute to linguistic functions such as intonation, tone, stress and rhythm. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance (statement, question, or command); the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of language that may not be encoded by grammar or by choice of vocabulary. Prosody is neither completely universal nor automatic, but rather is expressed through the prosodic structure of each language.
2. Description of Related Art
Automatic speech recognition (ASR) can be defined as the independent, computer-driven transcription of spoken language into readable text in real time. In other words, ASR is technology that allows a computer to identify the words that a person speaks and convert the identified words to text.
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech computer or speech synthesizer, and can be implemented in software or hardware. A text-to-speech (TTS) system converts normal language text into speech.
Synthesized speech can be created by concatenating pieces of recorded speech stored in a database. Systems differ in the size of the stored speech units. A system that stores phones or diphones provides the largest output range, but may lack clarity. For specific usage domains, the storage of entire words or sentences allows for high-quality output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely “synthetic” voice output.