1. Field of the Invention
The invention relates to a method for speech recognition and a communication device, in particular a mobile telephone or a portable computer having a speech recognition entity.
2. Description of the Related Art
Communication devices such as e.g. mobile telephones or portable computers have undergone progressive miniaturization during recent years, in order to facilitate their use when travelling. As well as greater portability, this progressive miniaturization also poses significant problems with regard to ease of use. As a result of the housing surface being smaller in comparison with earlier communication devices, the housing surface can no longer be equipped with a number of keys that corresponds to the functionality of the devices.
Since increasing the multiple assignment of keys is also incompatible with ease of operation, communication devices of greater complexity offer facilities for speech control. This requires speech recognition facilities in the communication device.
Many communication devices therefore offer speaker-independent speech control. In the context of speech control, the user enters command words such as “dial”, “phone book”, “emergency”, “reject” or “accept” in a manner which is similar to the customary speed dialing. The telephone application which is associated with these command words can be used directly in the corresponding manner by a user, without the user personally having to train the system for this purpose beforehand using this stock of words.
For this speaker-independent speech recognition, speech samples of the stock of words of many different people have been collected in a database, the people forming a representative cross section of the user group of the speech recognition system. In order to ensure that a representative cross section is provided, the selection of people takes into consideration different dialects, age groups and genders. With the aid of a so-called “cluster method”, e.g. an iterative algorithm, similar speech samples are combined in groups or so-called clusters. A phoneme, a phoneme cluster, or possibly even a whole word is assigned to the groups or clusters in each case. Within each group or cluster therefore, a plurality of typical representations of the phoneme, of the phoneme cluster, or of a whole word therefore exist in the form of corresponding model vectors. In this way, the speech characteristics of many different people are captured using few representations.
However convenient this factory-predetermined vocabulary of command words and possibly also names may be for the user, it does not however replace a user-specific adaptation, e.g. the insertion of new commands. This applies particularly in the case applies of name selection, i.e. a special speech control, wherein specific numbers are dialed when the name is spoken. Therefore devices of greater complexity offer a speaker-dependent speech control in addition to a speaker-independent speech control.
A speaker-dependent speech recognition system is optimized in relation to the speaker concerned, since it must be trained in the voice of the user before its first use. This is known as “say-in” or training. It is used to create a feature vector sequence from at least one feature vector.
Such a system, in which speaker-dependent and speaker-independent speech recognition are used in combination, can be seen in operation, i.e. during the speech recognition, in FIG. 1. In the context of feature extraction FE, a speech signal SS is temporally divided into time frames or frames F (framing) and subjected to preprocessing PP. One time frame can coincide with one sound or include a plurality of sounds. Likewise, a plurality of time frames may be required in order to form a sound. It is subjected to a Fourier transformation as part of the preprocessing PP.
The result of this transformation and of further preprocessing is a feature vector F_IS, which is subjected to a dimensional reduction by a linear discriminant analysis LDA-L, thereby producing a dimensionally reduced feature vector F_S. Since the dimensional reduction LDA-L is language-specific, the dimensionally reduced feature vector is likewise language-specific. Taking this dimensionally reduced feature vector F_S as a starting point, a speaker-independent speech recognition HMM-SI is carried out with reference to a monolingual language resource HMM-L. Alternatively, a speaker-dependent speech recognition SD is carried out on the basis of the feature vector F_IS.
For the speaker-independent speech recognition HMM-SI, the distances D are calculated between the relevant dimensionally reduced feature vector F_S and the model vectors which are present in the language resource HMM-L. During operation, this distance calculation provides a basis for determining an assignment to a model vector, or an assignment of a feature vector sequence to a model vector sequence. The information about the permitted model vector sequences is held in the speaker-independent vocabulary VOC-SI-L which was created by the factory or manufacturer. In the case of a search S using a trellis algorithm, for example, the suitable assignment or sequence of model vectors is determined on the basis of the distance calculation with reference to the vocabulary VOC-SI-L. A command word which is assigned to the model vectors is produced as a result R of the search S.
The speaker-dependent speech recognition SD which is also provided can take place on the basis of “dynamic time warping” DTW or neural networks, for example, i.e. correlation-based methods or sample-comparison methods, or other methods known to a person skilled in the art.
In this case, it is disadvantageous that speaker-dependent and speaker-independent speech recognition cannot be mixed in the system shown in FIG. 1, i.e. it must be known in advance of the speech recognition whether speaker-dependent or speaker-independent speech recognition is taking place.
In order to allow the mixing, the creation of an additional speaker-dependent vocabulary VOC-SD-L1 has been proposed in the case of speaker-independent speech recognition HMM-SI. This is shown in FIG. 2 in the context of the training or say-in in such a system. Blocks which are already present in FIG. 1 have the same reference characters in FIG. 2.
In this case, on the basis of the dimensionally reduced feature vector F_S, a distance calculation D is performed in relation to model vectors which are present in a language resource HMM-L1. This distance is converted D2I to an index or assignment information. This assignment information also represents the—in this case—speaker-dependent vocabulary VOC-SD-L1.
The system which was shown in training in FIG. 2 is now shown in operation, i.e. during speech recognition, in FIG. 3. Identical blocks are given the same reference characters again in this context.
Since this speaker-dependent vocabulary is also created on the basis of the language resource HMM-L1, i.e. by distance calculation D relative to the model vectors contained therein, speaker-dependent vocabulary VOC-SD-L1 and speaker-independent vocabulary VOC-SI-L1 can be mixed in the language L1 in the case of the system shown in FIG. 3, thereby eliminating the mixing problem which occurred in the case of the system in FIG. 1.
This has the disadvantage that, as a result of using the language resource HMM-L1 for creating the vocabulary, the system is now language-dependent, since the language resource HMM-L1 represents a specific language or language environment in relation to which the assignment information is stored.