A prosodic feature, which is also called as “prosodeme feature” or “supra-segmental feature”, such as tone, intonation, stress, length, intensity, pitch, locution, accent or the like of a pronouncer, is a feature component in supra-segment. Presently, there are extensive researches on philology and speech synthesis, mainly focused on stress, length and pitch, which are typically described with fundamental frequency and duration.
For example, in “The Influence of Correspondence between Accentuation and Information Structure on Discourse Comprehension” by LI Xiaoqing, et. al., Acta Psychologica Sinica, Issue 1, 2005 and “Studies on Speech Prosody” by YANG Yufang, et. al., Advances in Psychological Science, Volume 14, Issue 4, 2006, a series of researches on Chinese prosodic features are carried out in terms of perception, cognition and corpus. For the perception, prosodic hierarchies and relevant acoustic cues that can be distinguished perceptually are analyzed with experimental psychology and a perception-labeling corpus analysis method, and it is proved in the result that the prosodic boundaries in the discourse that can be distinguished perceptually are clauses, sentences and paragraphs, as well as perceptually relevant acoustic cues; for the cognition, the role of the prosodic feature in discourse comprehension is researched, and the influence of prosody on information integration and pronoun comprehension in the discourse is researched using the experimental psychology method and an electroencephalogram index, thereby to disclose the cognition and neural mechanism of the role; and for the corpus, based on the labeled corpus, regular stress distribution in sentences, and relation between the information structure and the stress in the discourse are researched using a regular statistical method, and rules of determining prosodic phrase boundaries and focuses according to text information are researched using a decision tree method. Therefore, the research proves the influence of the prosodic feature on perceptive level. However, since the research is from the view of philological grammar analysis, it is limited by the researched language, and how to extract the prosodic feature is not described in the research.
Furthermore, in “Study of Data-Driven Hierarchical Prosody Generation Model for Chinese Sentence Utterance” by Tian Lan, et. al., Control and Decision, Volume 18, Issue 6, 2003, with respect to the characteristic of Chinese pronunciation, a large number of fundamental-frequency profile data of natural Chinese sentences is analyzed statistically from the view of fundamental frequency, and by combining with parameters of duration and gain, prosody information in terms of mood, phrase rhythm, tone of prosodic word and stress of Chinese is researched. In this research, various parameters can be trained and labeled in accordance classification of language knowledge. However, it is difficult to well combine the obtained information of rhythm, stress, mood and the like with the current predominant acoustic features in the speech signal processing, such as MFCC (Mel Frequency Cepstral Coefficient), LPCC (Linear Prediction Cepstrum Coefficient), LSF (Line Spectrum Frequency) and so on.
Additionally, in “Study on Calculability of Chinese Prosodic Feature” by Cai Lianhong, et. al., The Proceeding of 5th National Conference on Modern Phonetics, 2001, quantitative representation of fundamental frequency and perception experiment on average value and pitch range of the fundamental frequency are researched, and the result shows the influence that the change of pitch range has on the auditory sense is less significant than the change of the average value has; meanwhile, the fundamental frequency, duration and pitch range are used as basic parameters to evaluate one syllable, and the stress is researched intensively. Although this research has attempted to carry out the study on calculability of prosody, the experiment is still established on the basis of philological analysis, and analyzes the stress using only the fundamental frequency, duration and signal amplitude. Therefore, such stress characterization requires data labeled manually, and neither can it be generated automatically, nor can it be applied by combining with acoustic features such as MFCC, LPCC and LSF.
At present, how the prosodic feature can be characterized and automatically calculated, and can be well combined with the predominant acoustic features, such as MFCC, LPCC, LSF and the like in the speech signal processing, is a challenge in prosody research, and is also a urgent problem required to be solved.