There is a known speech synthesis system that includes a server device configured to store speech element information representing respective speech elements included in speech uttered by a user (a speech registering user) and a client device configured to generate speech information obtained by converting text into speech based on text information representing the text (i.e., execute a speech synthesis process) JP2003-23386.
This client device generates speech element specification information (for example, information representing a phoneme and a prosody) that specifies a speech element based on inputted text information. Then, the client device transmits the generated speech element specification information to the server device.
The server device previously stores speech element information and speech element specification information in association with each other. The server device transmits speech element information stored in association with the speech element specification information received from the client device, to the client device. Then, the client device executes a speech synthesis process based on the speech element information received from the server device.
According to this speech synthesis system, the client device does not need to store speech element information, and it is therefore possible to ensure a large storage region that can be used by the client device.
In the abovementioned speech synthesis system, the server device transmits speech element information to the client device so that the speech element information is received by the client device in the same order as the order of arrangement of speech elements in speech corresponding to text represented by the text information. Therefore, in a case that part of the text corresponds to part of the speech uttered by the speech registering user, the server device transmits a portion including consecutive speech elements in the speech to the client device in a state that the order of arrangement of the speech elements in the speech is maintained.
Therefore, in such a case, there has been a problem that it is relatively highly possible that information transmitted from the server device to the client device is monitored (fraudulently acquired) by a fraudulent user, and thereby, the portion including the consecutive speech elements in the speech uttered by the speech registering user is acquired by the fraudulent user. If the speech is acquired by the fraudulent user, there is a fear that, for example, in an authentication process by voice (a voice authentication process), the acquired speech is used and the fraudulent user is thereby recognized as the speech registering user.
Accordingly, an object of the present invention is to provide a speech synthesis system capable of solving the aforementioned problem that the portion including the consecutive speech elements in the speech uttered by the speech registering user is acquired by the fraudulent user.