Our voice is so much than words; voice has intonations, intentions, patterns and this makes us truly unique in our communication. Unfortunately, all this is completely lost when we interact through today's speech recognition tools.
The term “Prosody” refers to the sound of syllables, words, phrases, and sentences produced by pitch variation in the human voice. Prosody may reflect various features of the speaker or the utterance: the emotional state of the speaker; the form of the utterance (statement, question, or command); the presence of irony or sarcasm; emphasis, contrast, and focus; or other elements of language that may not be encoded by the grammar or choice of vocabulary. In terms of acoustics attributes, the prosody of oral languages involves variation in syllable length, loudness and pitch. In sign languages, prosody involves the rhythm, length, and tension of gestures, along with mouthing and facial expressions. Prosody is typically absent in writing, which can occasionally result in reader misunderstanding. Orthographic conventions to mark or substitute for prosody include punctuation (commas, exclamation marks, question marks, scare quotes, and ellipses), and typographic styling for emphasis (italic, bold, and underlined text).
Prosody features involve the magnitude, duration, and changing over time characteristics of the acoustic parameters of the spoken voice, such as: Tempo (fast or slow), timbre or harmonics (few or many), pitch level and in particular pitch variations (high or low), envelope (sharp or round), pitch contour (up or down), amplitude and amplitude variations (small or large), tonality mode (major or minor), and rhythmic or non rhythmic behavior.
Existing technologies are already making use of the prosodic information.
EP 1927979 B1 discloses a voice recognition apparatus for use in recognizing an input voice on the basis of a prosody characteristics of the voice. Prosodic information is used as an additional feature to spectrum for the pattern matching.
U.S. Pat. No. 8,566,092 B2 discloses a method and apparatus for extracting a prosodic feature and applying the prosodic feature by combining with a traditional acoustic feature.
Dynamic time warping (DTW) is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences that may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another he or she were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics—indeed, any data that can be turned into a linear representation can be analyzed with DTW.
The article entitled: “Speech segmentation without speech recognition” by Dong Wang, Lie Lu and Hong-Jiang Zhang, Proceedings IEEE International Conference for Acoustics, Speech, and Signal 30 Processing, 2003, Volume: 1, presents a semantic speech segmentation approach, in particular sentence segmentation, without speech recognition. In order to get phoneme level information without word recognition information, a novel vowel/consonant/pause (V/C/P) classification is proposed. An adaptive pause detection method is also presented to adapt to various backgrounds and environments. Three feature sets, which include a pause, a rate of speech and a prosody, are used to discriminate the sentence boundary.
The article entitled: “Cross-words Reference Template for DTW-based Speech Recognition Systems” by Waleed H. Abdulla, David Chow, and Gary Sin of Electrical and Electronic Engineering Department of ‘The University of Auckland’, Auckland, New Zealand, published TENCON 2003 Conference on Convergent Technologies for the Asia-Pacific Region, Volume: 4 discloses a simple novel technique for preparing reliable reference templates to improve the recognition rate score. The developed technique produces templates called crosswords reference templates (CWRTs).
Various extraction techniques may be used for extraction features from a voice. Using Mel Frequency Cepstral Coefficient (ceptral coefficient) and Dynamic Time Warping (DTW) for features extraction as part of voice recognition algorithm, is described in the article entitled “Voice Recognition Algorithm using Mel Frequency Cepstral Coefficient (CEPSTRAL COEFFICIENTS) and Dynamic Time Warping (DTW) Techniques” by Lindasa Muda, Mumtaj Begham, and I. Elamvazuthi published Journal of Computing, Vol. 2 Issue 3, March 2010.
In consideration of the foregoing, it would be an advancement in the art to provide a system and method supporting prosody features extraction, and using the extracted prosody features as an integrated pattern classification—prosody detection engine, putting constraints on the recognizer search space, and by doing so, improving speech or voice recognition.