A recent sophistication and downsizing of a computer allows the speech synthesizing technology to be installed and used in various devices such as a car navigation device, a mobile phone, a PC (Personal computer), a robot, etc. Widespread use of this technology in various devices finds applications in a variety of environments where a speech synthesizing device is used.
In a conventional, commonly-used speech synthesizing device, the processing result of prosody (for example, pitch frequency pattern, amplitude, duration time length) generation, unit waveform (for example, waveform having the length of about pitch length or syllabic sound time length extracted from a natural speech) selection, and waveform generation is basically determined uniquely for a phonetic symbol sequence (text analysis result including reading, syntax/part-of-speech information, accent type, etc.). That is, a speech synthesizing device always performs speech synthesizing in the same utterance form (volume, phonation speed, prosody, and voice tone of a voice) in any situation or environment.
However, the actual observation of human's phonation indicates that, even when the same text is spoken, the utterance form is controlled by the speaker's situation, emotion, or intention. Therefore, a conventional speech synthesizing device, which always uses the same utterance form, does not necessarily make the best use of the characteristics of a speech that is one of communication media.
To solve the problem with a speech synthesizing device like this, an attempt is made to generate a synthesized speech suited for the user environment and to improve the user's usability by dynamically changing the prosody generation and the unit waveform selection according to the user environment (situation and environment of the place where the user of the speech synthesizing device is present). For example, Patent Document 1 discloses the configuration of a speech synthesizing system that selects the control rule for the prosody and phoneme according to the information indicating the light level of the user environment or the user's position.
Patent Document 2 discloses the configuration of a speech synthesizing device that controls the consonant power, pitch frequency, and sampling frequency based on the power spectrum and frequency distribution information on the ambient noises.
In addition, Patent Document 3 discloses the configuration of a speech synthesizing device that controls the phonation speed, pitch frequency, sound volume, and voice quality based on various types of clocking information including the time of day, date, and day of week.
Non-Patent Documents 1-3 that disclose the music signal analysis and search method, which constitute the background technology of the present invention, are given below. Non-Patent Document 1 discloses a genre estimation method that analyzes the short-time amplitude spectrum and the discrete wavelet conversion coefficients of music signals to find musical characteristics (instrument configuration, rhythm structure) for estimating the musical genre.
Non-Patent Document 2 discloses a genre estimation method that estimates the musical genre from the mel-frequency cepstrum coefficients of the music signal using the tree-structured vector quantization method.
Non-Patent Document 3 discloses a method that calculates the similarity using the spectrum histograms for retrieving the musical signal.
Patent Document 1:
Japanese Patent No. 3595041
Patent Document 2:
Japanese Patent Publication Kokai JP-A-11-15495
Patent Document 3:
Japanese Patent Kokai Publication JP-A-11-161298
Non-Patent Document 1:
Tzanetakis, Essl, Cook: “Automatic Musical Genre Classification of Audio Signals”, Proceedings of ISMIR 2001, pp. 205-210, 2001.
Non-Patent Document 2:
Hoashi, Matsumoto, Inoue: “Personalization of User Profiles for Content-based Music Retrieval Based on Relevance Feedback”, Proceedings of ACM Multimedia 2003, pp. 110-119, 2003.
Non-Patent Document 3:
Kimura et al.: “High-Speed Retrieval of Audio and Video In Which Global Branch Removal Is Introduced”, Journal of The Institute of Electronics, Information and Communication Engineers, D-II, Vol. J85-D-II, No. 10, pp. 1552-1562, October, 2002