As is known, voice-driven applications and complex voice services are based on automatic speech recognition systems (ASRs), which are designed to convert a digital representation of a voice signal, which conveys the speech, into a textual sequence of words, which hypothesizes the lexical content of the voice signal. The automatic recognition process uses stochastic acoustic models, hence, the result produced, in terms of sequence of recognized words, may be affected by an other than zero residual error rate. Furthermore, the domain of the formulations recognized by an automatic speech recognition system is in any case tied to a limited vocabulary, formalized by means of a statistical model of the language or context-free grammars, which can be traced back to finite-state automata (this is the case, for example, of a grammar that describes the way of pronouncing a date or a time).
The most advanced automatic speech recognition systems also enable recognition within flexible vocabularies, which are defined by the user and described by means of appropriate formalisms. To achieve this result, the voice models used for recognition are made up of elementary acoustic-phonetic units (APUs), the sequential composition of which enables representation of any word of a given language.
The mathematical tools used for describing the temporal evolution of the voice are the so-called Hidden Markov Models (HMMs), and each elementary acoustic-phonetic unit is represented by a Hidden Markov Model, which is formed by states that describe the temporal evolution thereof. The words to be recognized, which are described as sequences of elementary acoustic-phonetic units, are obtained by concatenating individual constituent Hidden Markov Models.
In addition to describing the temporal evolution of the speech, hidden Markov models enable generation of the likelihoods of emission of the acoustic states that form them, given the observation vectors that convey the information of the voice signal. The sequence of the probabilities, together with their temporal evolution, enables the recognition result to be obtained. For a more detailed description of theory, algorithms and implementation of Hidden Markov Models, reference may be made to Huang X., Acero A., and Hon H. W., Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall Chapter 8, pages 377-413, 2001.
The pronunciation of words outside the vocabulary or of formulations that are not covered consequently generates recognition errors. Therefore, automatic speech recognition systems also provide a measure of the reliability of the recognized words, in particular a reliability indicator comprised between 0 and 1, which is known in the literature as confidence measure and can be applied to the individual recognized words and/or to their sequence. In the event of recognition error, the confidence measure should assume low values, and in any case lower than those that are assumed in the absence of errors. A threshold on the confidence values measured can be fixed so as to prevent proposal of results that are not so reliable.
A technique that is widely used to compute the confidence measure consists in normalizing either the so-called a posteriori likelihoods, which are quantities derived from the emission likelihoods, or directly the emission likelihoods, that concur in the confidence measure computation. The comparison of the two quantities, i.e., the result obtained applying the constraints and the result obtained relaxing the constraints, provides information useful for determining the confidence. In fact, if the two quantities have comparable values, it means that the introduction of the recognition constraints has not produced any particular distortion with respect to what would have happened without recognition constraints. The recognition result may therefore be considered reliable, and its confidence should have high values, close to its upper limit. When, instead, the constrained result is considerably worse than the unconstrained result, it may be inferred that the recognition is not reliable in so far as the automatic speech recognition system would have produced a result different from the one obtained as a consequence of the application of the constraints. In this case, the confidence measure ought to produce low values, close to its lower limit.
An example of this technique is proposed in Gillick M. et al., A Probabilistic Approach to Confidence Estimation and Evaluation, Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, pp. 879-882 (May 1997), where the difference between quantities known as acoustic score and best score is adopted, where the two terms are obtained respectively by averaging the acoustic score (with constraints) and the best score (without constraints), produced for each frame by the acoustic hidden Markov models on the time interval corresponding to the words.
In PCT/EP/0453718 filed on Dec. 28, 2004 in the name of the Applicant, it is instead proposed a confidence measure based upon differential contributions computed for each frame of an analysis window as a difference between an unconstrained acoustic score and a constrained acoustic score, and averaged over the entire recognition interval. This makes it possible to act on the individual differential contribution of the summation, by applying thereto a respective normalization function, which makes the confidence measure homogeneous in terms of rejection capability and invariant with respect to language, vocabulary and grammar, and, in general, to the recognition constraints. This facilitates to a great extent development of applications in the initial stages of their development, since they do not require any specific calibration for each individual recognition session. The normalization function applied to the individual differential terms is constituted by a family of cumulative distributions, one for each differential contribution to the confidence measure. Each function can be estimated in a simple way based upon a set of training data and is specific for each state of the elementary acoustic-phonetic unit. Hence, the proposed solution does not require heuristic considerations or a priori assumptions and makes it possible to obtain all the quantities necessary for deriving the differential confidence measure directly from the training data.
One of the main problems experienced by the designers of voice-driven applications or voice services based on automatic speech recognition is the correct prediction of the behaviour of the users, which problem is typically faced by creating grammars targeted at acquiring information from users without however using too extensive vocabularies or creating excessively complex graphs. The risk, in fact, is that an improvement in the coverage of formulations from marginal users is paid in terms of higher recognition errors made by the system on canonical formulations due to the increase in complexity. On the other hand, for directory assistance or call routing services, it is extremely difficult to predict the manner in which the users will formulate their requests.
A possible solution to this problem, that does not make use of an automatic data analysis, consists in making a first version of the grammars using test data, fielding the service, that will consequently have sub-optimal performance, and, at the same time, collecting data relative to its use in the field, which typically consists of the audio files with the users' requests. Human operators then label the dialogue turns involved in system failures, and once a substantial amount of data has been labeled, statistics can be generated on the causes of failure: recognition errors, possible errors due to systems-level reasons and cases in which the user requests are not catered for by the system, can be numbered amongst these. For the last type of error, when frequent, it is possible to extend the grammars utilized, so as to increase coverage, or use other, more sophisticated strategies. The application designers, for example, could change the voice prompts related to one or more dialog turns, in order to help the users formulate their requests. This solution is however extremely expensive because the data must be analyzed by human operators to know exactly the content of the user request.
Also known in the art are automatic analysis systems based on the use of data collected in the field, relative to interactions with real users, for improving the performance of the grammars and language models utilized in a voice service. In particular, data is automatically acquired by the recognition systems and is not checked by operators for reasons of excessively high costs, with the risk of it containing recognition errors. For example, in U.S. Pat. No. 6,499,011 the recognition results, i.e., the first N-Best hypotheses, with N>1, are used to make adjustments to the language models in order to improve applications in which the amount of material used to train the initial language models is quite poor. In this case, the performance improvement is focused on improving the modelling of already predicted formulations.
A technology that has been tested for an automatic directory assistance service is described in U.S. Pat. No. 6,185,528 and is based on discrete words, although an item in the recognition vocabulary could also be a sequence of words and not necessarily a single word. With regard to the business subscriber directories, great variability in the way users of the service express their requests has been observed. As the contents of the database for these users is not sufficient to extract information on the linguistic formulation utilized by the callers, it is necessary to perform a complex task to derive the possible pronunciation variants for each database record.
Within this context, an automatic learning system has been developed that utilizes the data collected in the field to determine the linguistic formulations utilized most frequently by the users and not contemplated by the system implementing the automatic directory assistance service. The information regarding calls for which the automatic directory assistance system is not able to satisfy the user's requests has been considered. These calls are handed over to a human operator, who talks with the user to provide the requested number. The available data for each call is represented by saved audio files, which contain the dialogues between the user and the automatic directory assistance system, the unconstrained phonetic transcription of each audio file (unconstrained phonetic transcription represents the most likely sequence of phonemes and although imprecise, more often than not it represents what the user has pronounced fairly well) and information regarding the telephone number supplied by the human operator. From these pieces of information, those regarding the most frequently requested telephone numbers have been selected.
The Applicant has noted that, given an extremely large set of requests for the same number, there is a high probability of obtaining phonetic strings that are similar to each other. The concept of distance between two strings of phonemes can be introduced by performing a Viterbi alignment, for a detailed description of which reference may be made to the above-referenced Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, chapter 8, and using the probabilities of deletion, insertion or substitution of phonemes, which probabilities are trained with data controlled through the alignment of the unconstrained phonetic transcriptions with the corresponding correct phonetic transcriptions. The set of phonetic transcriptions, for each frequently requested telephone number, is subsequently clustered into similar subsets using a hierarchical neighbor-search algorithm, based on distance between the phonetic strings. A set of similar phonetic transcriptions is determined by configuring a threshold for the maximum distance of the phonetic strings forming part of the same cluster. Subsets with few elements or having a large difference in distance between the constituent phonetic strings are discarded. For clusters characterized by high cardinality and low dispersion on the constituent phonetic strings, the central element (representative element), defined as the phonetic string with the lowest sum of distances in relation to the other elements of the set, is selected.
It is worthwhile observing that when the number of elements in a cluster is sufficiently high, the representative element provides a good phonetic transcription of the requested entry. The entire architecture of the automatic learning system, the results of the trials carried out and the improvements in terms of automation (increase in percentage of calls satisfactorily handled by the automatic directory assistance system) are described in detail in:                Adorno M., P. Laface, C. Popovici, L. Fissore, C. Vair, Towards Automatic Adaptation of the Acoustic Models and of Formulation Variants in a Directory Assistance Application, Proceedings of ISCA ITR-Workshop, pp. 175-178, Sophia-Antipolis (France), 2001; and        Popovici C., P. Laface, M. Adorno, L. Fissore, M. Nigra, C. Vair, Learning New User Formulation in Automatic Directory Assistance, Proceedings of ICASSP, 1448-1451, Orlando (USA), 2002.        
However, the automatic learning developed in this context requires voice service data, such as user confirmations and telephone numbers provided by the operator for handover calls, to identify the individual phonetic strings that will subsequently be utilized. In addition, the representative phonetic strings that are found are added as items to a discrete word vocabulary.
In this connection, the Applicant has noted that the commonly used continuous speech recognition systems which are based on grammars or language models, work on the entire phrase that is pronounced and do not identify only the portion of the users' formulations that is outside the recognition domain. The local identification of words that are not contemplated by the recognition grammars or language models within a repetition would be particularly advantageous because firstly it would allow benefiting from the results of the phonetic learning algorithms, even with a not excessively abundant amount of data, and secondly it would allow not covered words to be detected even if other covered words within the sequence change (as long as these are permitted by grammatical constraints), whilst a system that works on discrete words is not capable of working as efficiently in such cases.