Speech is an acoustic expression of a language, the most natural, most effective, and most convenient means for humans to exchange information, and also a medium for carrying human thoughts. Automatic Speech Recognition (ASR) usually refers to a process in which a device like a computer converts what is spoken by humans into corresponding output texts or instructions through speech recognition and interpretation. The core framework is that, on the basis of modeling with a statistical model and according to a characteristic sequence O extracted from a to-be-recognized speech signal, an optimal word sequence W* corresponding to the to-be-recognized speech signal is calculated using the following Bayes decision rule:W*=argmaxP(O|W)P(W)
In some implementations, the above process of arriving at the optimal word sequence is referred to as a decoding process (a module for achieving decoding function is usually referred to as a decoder), namely, an optimal word sequence shown by the equation above is found through searching in a search space formed by a variety of knowledge sources, such as lexicons, language models, and the like.
Along with the development of various technologies, hardware computation capabilities and storage capacities have been greatly improved. Speech recognition systems have been gradually applied in the industry, and various applications that use speech as a human-machine interaction medium have also appeared on client devices, for example, a calling application on smartphones can automatically place a phone call when a user simply gives a speech instruction (e.g., “call Zhang San”).
Existing speech recognition applications typically use two modes. One mode is based on client and server, i.e., a client collects speech, which is uploaded via a network to a server, and the server recognizes the speech to obtain texts via decoding and sends the texts back to the client. Such mode is adopted because the client has a relatively weak computation capability and limited memory space, while the server has significant advantages in these two aspects. If there is no network access when this mode is used, however, the client is unable to complete the speech recognition function. In light of this problem, a second mode of speech recognition application has been developed that only depends on the client. In such mode, the model and the search space that are originally stored on the server are downsized to store locally on the client device, and the client completes operations of speech collection and decoding on its own.
In an actual application, when the above general framework is used for speech recognition in either the first mode or the second mode, it is usually impossible to effectively recognize contents in a speech signal that are related to local information of a client device, e.g., a contact name in Contacts, thereby leading to a low recognition accuracy, causing inconvenience to the user, and affecting the user experience.