There is a known speech synthesis system that includes a server device configured to store speech element information representing respective speech elements included in speech uttered by a user (a speech registering user) and a client device configured to generate speech information obtained by converting text into speech based on text information representing the text (i.e., execute a speech synthesis process) (refer to Patent Document 1).
This client device generates speech element specification information (for example, information representing a phoneme and a prosody) that specifies a speech element based on inputted text information. Then, the client device transmits the generated speech element specification information to the server device.
The server device previously stores speech element information and speech element specification information in association with each other. The server device transmits speech element information stored in association with the speech element specification information received from the client device, to the client device. Then, the client device executes a speech synthesis process based on the speech element information received from the server device.
According to this speech synthesis system, the client device does not need to store speech element information, and it is therefore possible to ensure a large storage region that can be used by the client device.
[Patent document 1] JP2003-233386 A
In order to reduce the amount of information transmitted from the client device to the server device, it is considered to be favorable that the client device transmits not speech element specification information but speech element identification information representing an integer for identifying a speech element.
In this case, for example, the speech synthesis system is configured so that the client device previously stores speech element specification information and speech element identification information in association with each other and the server device previously stores speech element identification information and speech element information in association with each other.
In this case, it is assumed that the server device stores speech element identification information and speech element information in association with each other so that integers represented by the speech element identification information are integers increased by one in the order of arrangement of speech elements in speech. In this case, when a client device used by a fraudulent user transmits a plurality of integers increased by one (i.e., consecutive integers), the server device transmits a portion including consecutive speech elements in the speech to the client device in a state that the order of arrangement of the speech elements in the speech is maintained.
Accordingly, in such a case, there has been a problem that it is relatively highly possible that the portion including the consecutive speech elements in the speech uttered by the speech registering user is acquired by the fraudulent user. If the speech is acquired by the fraudulent user, there is a fear that, for example, in an authentication process by voice (a voice authentication process), the acquired speech is used and the fraudulent user is thereby recognized as the speech registering user.