The present invention relates to a learning apparatus and a learning method, and particularly to a learning apparatus, a learning method, a recognition apparatus, a recognition method, and a recording medium which enable recognition of a signal including a nonlinear time component, such as speech or the like, without considering the time component.
Also, the present invention relates particularly to a learning apparatus, a learning method, a recognition apparatus, a recognition method, and a recording medium which are capable of improving a recognition rate by providing models capable of sufficiently expressing, for example, a transition of a state or the like.
Further, the present invention relates to a learning apparatus, a learning method, a recognition apparatus, a recognition method, and a recording medium which are capable of dealing with parameters concerning speech and images by using equal weights, for example, where speech recognition is carried out based on a speech and an image of lips when the speech is pronounced.
For example, with respect to speech, the length of a word nonlinearly extends or contracts every time, even if one person pronounces the same word twice. Therefore, when recognizing pronunciation, it is necessary to cope with such nonlinear extension or contraction of length. For example, a DP (Dynamic Programming) matching method is known as a method in which matching to a standard pattern is carried out while DTW (Dynamic Time Warping) is performed by performing nonlinear time-axis extension or contraction.
However, even if the time-axis extension or contraction is carried out by the DP matching method, there is no guarantee that phonemes of an inputted speech properly correspond to phonemes of a standard pattern. If the phonemes do not correspond properly, a recognition error occurs.
Meanwhile, if matching can be performed without considering nonlinear time components of speech, recognition errors due to time-axis extension or contraction as described above can be prevented.
Also, as an algorithm for recognizing speech, a HMM (Hidden Markov Models) method has been conventionally known. In a discrete HMM method, learning is previously carried out so that models corresponding to recognition targets are obtained. From each model, a probability (observation probability) at which an input series corresponding to an inputted speech is observed is calculated on the basis of a state transition probability given to the model (at which a state transits to another state which normally includes transition to itself) and an output probability (at which a certain code (label or symbol) is outputted when transition of a state occurs). Further, based on the observation provability, the inputted speech is recognized.
Meanwhile, with respect to learning in the HMM method, a manufacturer of a system determines the number of states and forms of state transitions (e.g., a limitation to state transition by which the transition from a state to another state is limited to either itself or a right adjacent state), and models thereof are used to carry out the learning.
However, the models which are, as it were, determined by the system manufacturer do not always comply with the number of states or forms of state transition which recognition targets originally have. Further, if the models do not comply with the number of states or forms of state transition which observation targets originally have, several models cannot correctly express steady states or transiting states, and as a result, the recognition rate is deteriorated.
Further, for example, recognition of a speech is achieved by extracting a characteristic parameter from the speech and comparing the characteristic parameter with a standard parameter (standard pattern) as a reference.
Meanwhile, if recognition of a speech is carried out based only on the speech, improvement of the recognition rate is limited to some extent. Hence, it is possible to consider a method in which the recognition rate is improved by using an image obtained by picking up lips of a speaker who is speaking, in addition to the speech itself.
In this case, a characteristic parameter extracted from the speech and a characteristic parameter extracted from the image of lips are integrated (combined) with each other, into an integrated parameter. It is considered that this integrated parameter can be used to carry out recognition of the speech.
However, if a characteristic parameter of a speech and a characteristic parameter of an image are simply integrated in parallel (or simply joined with each other) to achieve recognition, the recognition is influenced strongly from either the speech or image (i.e., one of the speech and the image may be weighted more than the other), thereby hindering improvement of the recognition rate.
An advantage of the present invention is, therefore, to achieve improvements of the recognition rate by enabling recognition without considering a time component of a signal.
Another advantage of the present invention is to achieve improvements of the recognition rate of speech and the like by providing a model which can sufficiently express the number of states and the like which a recognition target originally has.
A further advantage of the present invention is to achieve improvements of the recognition performance by making it possible to deal with characteristic parameters of different inputs such as a speech and an image, with equal weights.
To this end, a learning apparatus according to an embodiment of the present invention is provided. The learning apparatus includes calculation means for calculating an expectation degree of each identifier, from a series of identifiers indicating code vectors, obtained from a time series of learning data.
A learning method according to an embodiment of the present invention calculates an expectation degree of each identifier, from a series of identifiers indicating code vectors, obtained from a time series of learning data.
A recording medium according to an embodiment of the present invention records a program having a calculation step of calculating an expectation degree of each identifier, from a series of identifiers indicating code vectors, obtained from a time series of learning data.
A recognition apparatus according to the present invention includes vector quantization means for vector-quantizing input data and for outputting a series of identifiers indicating code vectors. Properness detection means are provided for obtaining properness as to whether or not the input data corresponds to the recognition target, with use of the series of identifiers obtained from the input data and expectation degrees of identifiers. Recognition means are provided for recognizing whether or not the input data corresponds to the recognition target, based on the properness.
A recognition method according to the present invention is characterized in that: input data is vector-quantized, thereby to output a series of identifiers indicating code vectors; properness as to whether or not the input data corresponds to a recognition target is obtained with use of the series of identifiers obtained from the input data and expectation degrees of the identifiers at which the identifiers are expected to be observed; and whether or not the input data corresponds to the recognition target is recognized, based on the properness.
A recording medium according to the present invention is characterized by recording a program including: a vector-quantization step of vector-quantizing the time series of input data pieces, thereby to output a series of identifiers indicating code vectors; a properness detection step of obtaining properness as to whether or not the time series of input data pieces corresponds to the recognition target, with use of the series of identifiers obtained from the input data and expectation degrees of the identifiers at which the identifiers are expected to be observed; and a recognition step of recognizing whether or not the time series of input data pieces corresponds to the recognition target, based on the properness.
It should be appreciated that the term xe2x80x9cpropernessxe2x80x9d as used throughout the text means the same as and/or is interchangeable with the term xe2x80x9cmeasure of correctnessxe2x80x9d or other like term or terms.
In a learning apparatus, a learning method, and a recording medium according to the present invention, an expectation degree is calculated from a series of identifiers obtained from a time series of learning data pieces.
In a recognition apparatus, a recognition method, and a recording medium according to the present invention, input data is vector-quantized thereby to output a series of identifiers indicating code vectors, and properness as to whether or not the input data corresponds to a recognition target is obtained with use of the series of identifiers obtained from the input data and expectation degrees of the identifiers at which the identifiers are expected to be observed. Further, whether or not the input data corresponds to the recognition target is recognized, based on the properness.
A learning apparatus according to the present invention includes distance calculation means for calculating a distance between a standard series and a code vector and for outputting transition of the distance.
A learning method according to the present invention includes calculating a distance between a standard series and a code vector and outputting transition of the distance.
A recording medium according to the present invention records a program including a distance calculation step of calculating a distance between a standard series and a code vector and of outputting transition of the distance.
A recognition apparatus according to the present invention includes: storage means which store a distance transition model corresponding to at least one recognition target and expressing transition of a distance between a standard series and each code vector of a code book; vector quantization means for vector-quantizing a time series of input data, with use of the code book and for outputting a series of identifiers corresponding to the code vectors; and recognition means for recognizing whether or not the input data corresponds to at least one recognition target, based on the distance transition model and the series of identifiers with respect to the input data.
A recognition method according to the present invention is characterized in that a time series of input data is vector-quantized with use of a code book thereby to output a series of identifiers corresponding to code vectors, and whether or not the input data corresponds to at least one recognition target is recognized, based on a distance transition model expressing transition of a distance between a standard series and a code vector and corresponding to at least one recognition target and a series of identifiers with respect to the input data.
A recording medium according to the present invention records a program including: a vector quantization step of vector-quantizing a time series of input data with use of a code book and of outputting a series of identifiers corresponding to code vectors; and a recognition step of recognizing whether or not the input data corresponds to at least one recognition target, based on a distance transition model expressing transition of a distance between a standard series and a code vector and corresponding to at least one recognition target and a series of identifiers with respect to the input data.
A recognition apparatus according to the present invention includes: integration means for integrating a time series of first input data and a time series of second input data, thereby to output a time series of integrated data; and recognition means for recognizing whether or not the time series of first or second input data corresponds to at least one recognition target, based on transition of a distance obtained from a vector based on the time series of integrated data.
A recognition method according to the present invention is characterized in that a time series of first input data and a time series of second input data are integrated thereby to output a time series of integrated data, and whether or not the time series of first or second input data corresponds to at least one recognition target, based on transition of a distance obtained from a vector based on the time series of integrated data.
A recording medium according to the present invention records a program including: an integration step of integrating a time series of first input data and a time series of second input data, thereby to output a time series of integrated data; and a recognition step of recognizing whether or not the time series of first or second input data corresponds to at least one recognition target, based on transition of a distance obtained from a vector based on the time series of integrated data.
In a learning apparatus, a learning method, and a recording medium according to the present invention, a distance between a standard parameter and a code vector is calculated and transition of the distance is outputted.
In a recognition apparatus, a recognition method, and a recording medium according to the present invention, a time series of input data is vector quantized with use of a code book, and a series of identifiers corresponding to code vectors is outputted. Further, whether or not the input data corresponds to at least one recognition target is recognized, based on a distance transition model expressing a distance between a standard series and a code vector and corresponding at least one recognition target and a series of identifiers with respect to the input data.
In a recognition apparatus, a recognition method, and a recording medium according to the present invention, a time series of first data and a time series of second data are integrated and a time series of integrated data is outputted. Further, whether or not the first or second data corresponds to at least one recognition target, based on transition of a distance obtained from a vector based on the time series of integrated data.
A learning apparatus according to the present invention includes: characteristic parameter normalization means for normalizing each of a plurality of characteristic parameters, based on a normalization coefficient; distance calculation means for calculating a distance to a standard parameter, with respect to each of the plurality of characteristic parameters normalized; and change means for changing the normalization coefficient such that a distance with respect to an arbitrary one of the plurality of characteristic parameters and a distance with respect to another arbitrary one of the plurality of characteristic parameters are equal to each other.
A learning method according to the present invention is characterized in that: each of a plurality of characteristic parameters is normalized, based on a normalization coefficient; a distance to a standard parameter is calculated with respect to each of the plurality of characteristic parameters normalized; and the normalization coefficient is changed such that a distance with respect to an arbitrary one of the plurality of characteristic parameters and a distance with respect to another arbitrary one of the plurality of characteristic parameters are equal to each other.
A recording medium according to the present invention records a program including: a characteristic parameter normalization step of normalizing each of a plurality of characteristic parameters, based on a normalization coefficient; a distance calculation step of calculating a distance to a standard parameter, with respect to each of the plurality of characteristic parameters normalized; and a change step of changing the normalization coefficient such that a distance with respect to an arbitrary one of the plurality of characteristic parameters and a distance with respect to another arbitrary one of the plurality of characteristic parameters are equal to each other.
A recognition apparatus according to the present invention includes: normalization means for normalizing a characteristic parameter of each of a plurality of input data pieces; integration means for integrating a plurality of normalized characteristic parameters into an integrated parameter; and recognition means for recognizing whether or not one or more of the plurality of input data pieces correspond to a recognition target, based on the integrated parameter.
A recognition method according to the present invention is characterized in that: a characteristic parameter of each of a plurality of input data pieces is normalized; a plurality of normalized characteristic parameters are integrated into an integrated parameter; and whether or not one or more of the plurality of input data pieces correspond to a recognition target is recognized, based on the integrated parameter.
A recording medium according to the present invention is characterized records a program including: a detection step of detecting a characteristic parameter with respect to each of a plurality of input data pieces; a normalization step of normalizing a characteristic parameter of each of a plurality of input data pieces; an integration step of integrating a plurality of normalized characteristic parameters into an integrated parameter; and a recognition step of recognizing whether or not one or more of the plurality of input data pieces correspond to a recognition target, based on the integrated parameter.
In a learning apparatus, a learning method, and a recording medium according to the present invention, each of a plurality of characteristic parameters is normalized, based on a normalization coefficient, and a distance to a standard parameter is calculated with respect to each of the plurality of characteristic parameters normalized. Further, the normalization coefficient is changed such that a distance with respect to an arbitrary one of the plurality of characteristic parameters and a distance with respect to another arbitrary one of the plurality of characteristic parameters are equal to each other.
In a recognition apparatus, recognition method, and a recording medium according to the present invention, a characteristic parameter of each of a plurality of input data pieces is normalized, and a plurality of normalized characteristic parameters are integrated into an integrated parameter. Further, whether or not one or more of the plurality of input data pieces correspond to a recognition target is recognized, based on the integrated parameter.
Additional features and advantages of the present invention are described in, and will be apparent from, the following Detailed Description of the Invention and the Figures.