In recent years, development of speech synthesis technologies has allowed synthetic speeches to have significantly high sound quality.
However, conventional applications of synthetic speeches are mainly reading of news texts by broadcaster-like voice, for example.
In the meanwhile, in services of mobile telephones and the like, a speech having a feature (a synthetic speech having a high individuality reproduction, or a synthetic speech with prosody/voice quality having features such as high school girl delivery or Japanese Western dialect) has begun to be distributed as one content. For example, service of using a message spoken by a famous person instead of a ring-tone is provided. In order to increase entertainments in communication between individuals as the above example, a desire for generating a speech having a feature and presenting the generated speech to a listener will be increased in the future.
A method of synthesizing a speech is broadly classified into the following two methods: a waveform connection speech synthesis method of selecting appropriate speech elements from prepared speech element databases and connecting the selected speech elements to synthesize a speech; and an analytic-synthetic speech synthesis method of analyzing a speech and synthesizing a speech based on a parameter generated by the analysis.
In consideration of varying voice quality of a synthetic speech as mentioned previously, the waveform connection speech synthesis method needs to have speech element databases corresponding to necessary kinds of voice qualities and connect the speech elements while switching among the speech element databases. This requires a significant cost to generate synthetic speeches having various voice qualities.
On the other hand, the analytic-synthetic speech synthesis method can convert a voice quality of a synthetic speech to another by converting an analyzed speech parameter.
There is also a method of converting voice quality using a speaker adaptation technology. In this method, voice quality conversion is achieved by preparing voice features of other speakers and adapting the features to analyzed voice parameters.
In order to change a voice quality of voice, it is necessary to make a user designate, using some kind of method, a desired voice quality to which the original voice is to be converted. An example of the methods of designating the desired voice quality is that the user designates the desired voice quality using a plurality of sense-axis sliders as shown in FIG. 1. However, it is difficult for a user who does not have enough background knowledge of phonetics speech to designate the desired voice quality by adjusting such sliders. This is because the user has difficulty in verbalizing the desired voice quality by sense words. For example, in an example of FIG. 1, the user needs to adjust each slider axis expecting the desired voice quality, for instance, expecting “about 30 years old, very feminine, but rather gloomy and emotionless, . . . ”, but the adjustment is difficult for those who do not have enough background knowledge of phonetics. In addition, it is also difficult to expect the voice quality indicated by states of the sliders.
In the meanwhile, when voices of unfamiliar voice quality are heard, it is common in everyday life to express such voices by the following way. When a user listens to voices of unfamiliar voice quality, the user usually expresses the unfamiliar voice quality using a specific personal name the user knows, for example, expressing “similar to Mr./Ms. X's voice, but a bit like Mr./Ms. Y's voice” where X and Y are individuals the user actually knows. From the above, it is considered that the user can intuitively designate a desired voice quality by combining voice qualities of specific individuals (namely, voice qualities of individuals having certain features).
If the user edits voice quality by combining specific individual voice qualities previously held in a system as described above, a method of presenting the held voice qualities in an easily understandable manner is vital. Therefore, the voice quality conversion based on a speaker adaptation technology is performed using voice features of edited voices, thereby generating a synthetic speech having the user's desired voice quality.
Here, a method of presenting a user with sound information registered in a database and making the user select one of them is disclosed in Patent Reference 1. Patent Reference 1 discloses a method of making a user select a sound effect which the user desires from various sound effects. In the method of Patent Reference 1, the registered sound effects are arranged on an acoustical space based on acoustic features and sense information, and icons each associated with a corresponding acoustic feature of the sound effect are presented.
FIG. 2 is a block diagram of a structure of an acoustic browsing device disclosed in Patent Reference 1.
The acoustic browsing device includes an acoustic data storage unit 1, an acoustical space coordinate data generation unit 2, an acoustical space coordinate data storage unit 3, an icon image generation unit 4, an acoustic data display unit 5, an acoustical space coordinate receiving unit 6, a stereophony reproduction processing unit 7, and an acoustic data reproduction unit 8.
The acoustic data storage unit 1 stores a set of: acoustic data itself; an icon image to be used in displaying the acoustic data on a screen; and an acoustic feature of the acoustic data. The acoustical space coordinate data generation unit 2 generates coordinate data of the acoustic data on an acoustical space to be displayed on the screen, based on the acoustic feature stored in the acoustic data storage unit 1. That is, the acoustical space coordinate data generation unit 2 calculates a position where the acoustic data is to be displayed on the acoustical space.
The icon image to be displayed on the screen is generated by the icon image generation unit 4 based on the acoustic feature. In more detail, the icon image is generated based on spectrum distribution and sense parameter of the sound effect.
In Patent Reference 1, such arrangement of respective sound effects on a space makes it easy for the user to designate a desired sound effect. However, the coordinates presenting the sound effects are determined by the acoustical space coordinate data generation unit 2 and therefore the determined coordinates are standardized. This means that the acoustical space does not always match the user's sense.
On the other hand, in the fields of data display processing systems, a method of modifying an importance degree of information depending on a user's input is disclosed in Patent Reference 2. The data display processing system disclosed in Patent Reference 2 changes a display size of information held in the system depending on an importance degree of the information, in order to display the information. The data display processing system receives a modified importance degree from a user, and then modifies, based on modified information, a weight to be used to calculate the importance degree.
FIG. 3 is a block diagram of a structure of the data display processing system of Patent Reference 2. As shown in FIG. 3, an edit processing unit 11 is a processing unit that performs edit processing for a set of data elements each of which is a unit of data having meaning to be displayed. An edit data storage unit 14 is a storage device in which documents and illustration data to be edited and displayed are stored. A weighting factor storage unit 15 is a storage device in which predetermined plural weighting factors to be used in combining basic importance degree functions are stored. An importance degree calculation unit 16 is a processing unit that calculates an importance degree of each data element to be displayed, applying a function generated by combining the basic importance degree functions based on the weighting factor. A weighting draw processing unit 17 is a processing unit that decides a display size or display permission/prohibition of each of data elements according to the calculated importance degrees of the data elements, then performs display layout of the data elements, and eventually generates display data. A display control unit 18 controls the display device 20 to display the display data generated by the weighting draw processing unit 17. The edit processing unit 11 includes a weighting factor change unit 12 that changes, based on an input from an input device 19, the weighting factor associated with a corresponding basic importance degree factor stored in the weighting factor storage unit 15. The data display processing system also includes a machine-learning processing unit 13. The machine-learning processing unit 13 automatically changes the weighting factor stored in the weighting factor storage unit 15 by learning, based on operation information which is notified from the edit processing unit 11 and includes display size change and the like instructed by a user. Depending on the importance degrees of the data elements, the weighting draw processing unit 17 performs visible weighting draw processing, binary size weighting draw processing, or proportion size weighting draw processing, or a combination of any of the weighting draw processing.
Patent Reference 1: Japanese Unexamined Patent Application Publication No. 2001-5477
Patent Reference 2: Japanese Unexamined Patent Application Publication No. 6-130921