Speech recognition is technology for receiving the speech of a user, automatically converting the speech into text, and recognizing the text. Recently, speech recognition is employed as technology for substituting a keyboard input when used in a smartphone or a TV.
A speech recognition system is divided into a client that is a part that receives a speech signal input, and an automatic speech recognition (ASR) engine that is a part that performs speech recognition based on the speech signal. The client and the ASR engine may be separately designed. In this case, a smartphone or a TV may be configured in the form of a client, and the ASR engine may be configured in the form of a server.
The speech recognition system generally consists of an acoustic model (AM) and a language model (LM). The AM is formed as a model of a speech signal, and is generated by using a statistical method performed by collecting a large amount of speech data. The LM is a grammatical model for a user's speech, and also obtained by using a statistical learning method performed by collecting a large amount of text data.
A large amount of data needs to be collected to ensure the performance of the AM and the LM. A speaker-independent model refers to a model formed based on a plurality of unspecified speeches, whereas a speaker-dependent model refers to a model formed by collecting data from a specified speaker. If a sufficient amount of data may be collected, the speaker-independent model may have higher performance than the speaker-dependent model.
Since it may be realistically difficult to collect a sufficient amount of data to ensure performance with respect to a specified speaker, a method of efficiently changing an existing speaker-independent AM by using an appropriate amount of data has been provided. This method is referred to as a speaker adaptation for an AM.
A process of collecting data with respect to a specified speaker is necessary so as to employ speaker adaptation for an AM. For this, a process of collecting data by registering a specified speaker has been employed in the related art.
As an example, a first user may generate his/her account and perform a user registration process so as to use a speech recognition service. In the user registration process, a user may read a predetermined sentence to be registered for the speech recognition service, and in this case, accurate data may be obtained. However, this may cause user inconvenience, since the user may be required to perform a process that is inconvenient to the user.
As another method, speech recognition may be performed right after an account of a user is generated for a first time, and data obtained during the speech recognition may be used to perform speaker adaptation. In this case, the speaker adaptation starts to be performed from a second speech of the first user, or a speaker adaptation model is employed from a second connection to the speech recognition service.