1. Field of the Invention
The present invention generally relates to text to speech (TTS) technology. More particularly, the present invention relates to a method and system for performing speech synthesis of textual content at a client.
2. Description of Related Art
A text to speech (TTS) system is a widely used technology for people to access required information via speech. A typical application is to change a textual content which a user can access via the Internet to speech at a client such as a desktop computer, a laptop computer or a handheld device such as a mobile phone, a personal digital assistant or the like. Thus, the user can get information without reading the text. For such an application, the user needs to load a TTS system into his client. Now, more and more users tend to download a TTS system via the Internet instead of using a copy recorded on a storage medium.
Currently, most TTS systems perform speech synthesis based on the selection and concatenation of acoustical units. The speech synthesis based on the selection and concatenation of units requires a large amount of acoustical units in order to get satisfactory speech. For example, an IBM Chinese TTS system uses 25000 sentences as a corpus to synthesize good-quality speech, which is about 4 GB. Of course, these acoustical units can be compressed to 200 MB with some speech coding algorithms without hurting the speech quality too much. However, it is very big for users, who download speech data via a network, to download the speech data of 200 MB at a time. Users have to wait for quite a long time to begin to use the speech data.
In view of the problem outlined above there have been proposals to cut down a corpus to the greatest extent to get a smaller TTS system, e.g. 20 MB, for speech synthesis on the premise of successfully synthesizing various textual contents and ensuring an acceptable speech quality. In this case, users only need to wait for a very short time (for example, the time for downloading 20 MB data) to begin to use the TTS system. However, since the corpus of the downloaded TTS system is limited, the speech synthesis quality gotten by users during using the TTS system is rather poor. From the angle of users' use psychology, such a poor speech synthesis quality might be acceptable in a short time at the beginning of using the system but will be unsatisfactory after a long time of use.
European patent application, WO06128480A1, discloses a method and system for providing speech synthesis on user terminals over a communication network. In this patent, a basic database for speech synthesis is first downloaded on a user terminal, and multiple incremental corpus databases are generated on the TTS server side ahead of time according to possible topics, e.g. economics, sports, comics and so on. When a user accesses a textual content with this TTS system, the system extracts the topic of the textual content, selects a corresponding incremental corpus database according to the topic and adds the incremental corpus database to the basic database on the user client for speech synthesis of the textual content. Compared with the previous solution, this solution enables users to download a smaller TTS system quickly and begin to use it soon. With this solution, incremental databases can be increased little by little, so that the speech synthesis quality is improved continuously and users' satisfaction degree enhanced.
Based on this solution, each client needs to assign one of the existing contexts (topics) (e.g. economics, sports, comics and so on) to the text to be synthesized, selects an incremental corpus database existing on the TTS server side (for example, selects an incremental corpus database such as economics, sports, comics and so on) and then downloads the incremental corpus database.
There are some limitations in the technical solution disclosed by the European patent application WO06128480A1. This is because, during actual speech synthesis, synthesizing contents of a similar context (topic) might require a completely different set of acoustical units (syllables). For example, the text with the topic of sports might be about swimming or basketball, whereas these two actual contexts have an enormous difference in acoustical units during speech synthesis. Therefore, such a solution as assigning a specific context (topic) to the text so as to download a pre-generated specific corpus according to the specific context is inaccurate for TTS systems based on the selection and concatenation. Accordingly, to download a corpus database according to a topic with this solution will not enable a client to effectively improve the speech synthesis quality, and users still might be unsatisfied with such enhancement of the synthesis quality.
Therefore, there is a need for a TTS system-based text to speech solution and method, which can not only support a user to download and use a Text to speech system in a short time but also effectively improve the speech synthesis quality with the further use of the system by the user, thereby enhancing the text to speech service performance of the system.