In the field of telecommunication, speech recognition is sometimes employed in various communication services, meaning that a user is able to speak voice commands into a User Equipment, UE, for controlling some functionality therein or in a communication network, rather than entering written commands and pressing buttons on a keyboard or the like. In some applications, a speech recognition function in the UE or in the network is able to translate the entered voice command into a text such as a recognizable message or just a single word. A spoken voice command in the UE may also be sent in digitally encoded form to a speech recognition entity where the actual speech recognition is executed by analyzing and translating the speech into corresponding text. Recently, speech recognition has been applied for smart phones e.g. the speech-based service called “Siri” developed for Apple iPhones.
FIG. 1 illustrates an example of how conventional speech recognition can be used in a communication network for controlling some service function or apparatus which could be any voice-controllable device or function such as, e.g., a teleconference bridge, a banking service, an electronic game, functions in a telephone or computer, control of various home appliances, and so forth. Thus, when a spoken command is entered in a UE 100, shown as an action 1:1, the UE 100 provides a digitized version of the speech as signals to a speech recognition entity 102, shown as another action 1:2. The speech recognition entity 102 then translates the received speech signals into a text version of the speech, in an action 1:3. As said above, the speech recognition entity 102 may be implemented in the network or in the UE 100 itself.
Possibly, the entity 102 may also utilize a function referred to as “Artificial Intelligence”, AI, 104 to make a more or less elaborated interpretation of the spoken command, as shown by a schematic action 1:4. In that case, the AI function 104 basically deduces the meaning of a spoken question or command once it has been converted to text by the speech recognition 102. As a result, the speech recognition entity 102 may issue a control message or command corresponding to the entered speech, as shown in an action 1:5, which somehow controls or otherwise interacts with a service function or apparatus 106. The service function or apparatus 106 may then process the control message and operate accordingly such as providing a suitable response back to the UE 100, as shown by a final action 1:6.
In general, the speech recognition services known today include two parts, the actual speech recognition and the interpretation thereof e.g. by means of an AI function or the like. In different typical implementations, both of these parts may reside in the UE or partly or completely in nodes of the network. In the above-mentioned service Siri for iPhones, a simplified speech analysis and AI analysis is made by the phone, which in parallel may send the speech in text form to an AI function in the network for obtaining a more advanced analysis and creation of a suitable response or other action.
Voice-controlled applications are configured to operate according to different received speech input as commands or queries, e.g. an electronic game application implemented in a game server in the network which may receive various spoken lines from game participants for controlling the ongoing game. One or more words in a received speech input are typically significant for the command or query and are therefore often called “keywords” in this field. The one or more keywords in a received speech input must therefore be recognized such that the application is able to act and operate upon the speech input in a proper manner. To support this process, some kind of automatic speech analysis of the speech input needs to be made.
Computer implemented speech analysis may be executed according to some different techniques. A first example is generally referred to as “speech recognition” where all speech received in audio form is translated, word by word, into a text version of the entire speech input, thus comprising a chain of words. It is then easy for a computer to identify any keywords occurring in the text.
A second example is referred to as “keyword spotting” which does not require translation of the entire speech input into text but the audio is searched only for specific words or phrases by recognizing their sound, more or less, and then translating them into text. In general, keyword spotting requires less computing than speech recognition since only a limited word or phrase must be recognized for translation instead of an entire vocabulary.
A third example is referred to as “phonetic-based search” which is similar to keyword spotting in that only certain words are searched and identified in the speech input, although it does not require converting the speech input into text. In phonetic-based search, the process is divided into separate indexing and searching stages. In the indexing stage, the speech input is indexed to produce a phonetic search track which is a phonetic representation of the speech rather than words in text form. Once the indexing has been completed, the searching stage includes searching for a keyword in the form of phoneme, i.e. sound-based, sequences in the phonetic search track.
Even though certain significant keywords can be recognized and identified in a received speech input, e.g. using any of the above techniques, some applications may need to act and operate upon received keywords in different ways depending on the current situation. For example, a command may need certain actions when coming from one user and other actions when coming from another user. Further, some keywords may be significant for the application to act upon in one situation while other keywords may be significant for the application in another situation. It is thus a problem in currently known solutions that the use of keywords in speech input for controlling applications is somewhat static or inflexible and not adaptable to different situations.