The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
The speech recognition refers to the function of converting a speech input by a user into a text according to a specific grammar. For example, in an interaction application, a system first plays a prompt tone “What do you want? Water, soda, or fruit juice?” to a user. The user may reply via a speech, and the speech can only include the pronunciation of key words “water”, “soda”, “fruit juice” or “nothing”. The system may recognize the speech of the user and then provide the selected thing to the user.
In the fixed or mobile network application, there are usually two methods for a user to input.
Method 1: The user inputs a Dual Tone Multi-Frequency (DTMF). For example, in the above interaction application, when the user inputs “1”, it is indicated that “water” is selected; when the user inputs “2”, it is indicated that “soda” is selected; when the user inputs “3”, it is indicated that “fruit juice” is selected; and when any other key is pressed, it is indicated that “nothing” is wanted. Such a method has been defined in H.248 protocol.
Method 2: The user directly inputs a speech, and the system may deliver the speech input by the user to the other communicating party, or record the speech, or perform speech recognition.
A function similar to DTMF input may be accomplished via the speech recognition process. The system may determine the user selection according to the speech of the user. The advantages of the speech recognition lie in that, the user may interact with a system directly via a speech and no other auxiliary input device, such as device for inputting DTMF by pressing a key, is needed, so that the user input mode may be simplified. As the speech recognition technology improves, the speech recognition technology will become the predominant input mode.
H.248 protocol defines abundant media resource control methods via packages.
H.248.9 protocol defines the methods via Advanced Media Server Packages, including the following.
The method of playing a speech segment, in which the location of the speech segment may be indicated by a Uniform Resource Identifier (URI), and parameters such as number of iterations of the playing of the speech segment, The interval of silence to be inserted between iterative plays, and volume and speed of each playing, may be indicated;
The method of playing tone and DTMF collection, in which the prompt tone playing and the DTMF collection are performed interactively; and
The method of audio recording, in which the ID or the storage location of a record file is returned.
H.248.7 protocol defines a method for playing a record according to an announcement ID.
H.248.16 protocol defines a method for a complex DTMF digit collection operation.
However, the method for a user to directly input a speech is not defined in H.248 protocol, and the speech recognition function is needed in the media resource application environment.