In general, when editing recorded voice data, an editor specifies and cuts editing points while listening to a voice that is played back.
In Patent Document 5, when generating a voice card (which is generated by recording voice on a card and attaching photos on the card), an editor represents the voice on an editing window on a screen of a computer with an advanced voice edition program, and uses a tool, such as mouse, to delete, cut or combine part of the voice.
In addition, a voice recognition device uses a voice standard pattern (hereinafter referred to as ‘standard pattern’) as a voice recognition dictionary to recognize the voice. However, the standard pattern needs to be extended to increase the number of words that can be voice-recognized. In this case, part of an existing standard pattern may be deleted or cut to generate the standard pattern.
Edition of a standard pattern as a voice recognition dictionary in a voice recognition device will be described.
The voice recognition device divides the target voice into predetermined time intervals (frames), extracts a multi-dimensional feature parameter (cepstrum) indicating the feature of a voice waveform of each of the frames, compares a time series pattern of the feature parameter with a standard pattern (a time series pattern of a feature parameter of words that are a basic unit in voice recognition) that is accumulated in the voice recognition device, determines a similarity therebetween, and outputs words with a highest similarity as recognition results.
Cepstrum (feature parameter) is obtained by dividing a voice signal by a time frame of about 20 to 40 msec, which uses the fast Fourier transform (FFT) of the voice signal corresponding to the time frame, obtains the log of the amplitude spectrum, and uses the inverse discrete Fourier transform (IDFT) of frequency spectrum of the log.
A frequency spectrum of the voice obtained by the FFT includes approximate configuration information of the voice (envelope information indicating a phonological property) and information of a minute oscillation component (minute structure information indicating the pitch of the sound). In a case of voice recognition, it is important to extract the phoneme of the voice (that is, to estimate the sound of the voice) but the minute structure information is not as important. Accordingly, the envelope information and the minute structure information are divided from each other by using the IDFT of the frequency spectrum of the log.
When using the IDFT, the envelope information is concentrated on the left side of quefrency axis (horizontal axis), while the minute structure information is concentrated on the right side of the quefrency axis. Accordingly, the envelope information and the minute structure information can be efficiently divided from each other. This is the cepstrum. For voice analysis, LPC (Linear Predictive Coding) may be used instead of FFT.
Mel implies that the quefrency axis is converted to a logarithmic function according to the human auditory performance.
In the invention, ‘cepstrum’ includes ‘Mel-cepstrum’, which is mainly represented as a ‘feature parameter’. ‘Cepstrum’ or ‘feature parameter’ may be represented as ‘voice data.’ The ‘voice data’ of a super ordinate concept includes ‘voice converted into text’ and ‘voice data (waveform data)’ in addition to the feature parameter (cepstrum) of the voice.
The voice recognition device has a plurality of standard patterns (that is, cepstrum for each word that is a recognition unit: feature parameter indicating the features of the sound of the word) as a recognition dictionary. The voice recognition device needs to have a number of standard patterns to increase the number of words that can be recognized.
Patent Document 1 discloses a method of generating new standard patterns used for voice recognition by inputting text of words and automatically generating standard patterns of the words.
Patent Document 2 proposes that a phoneme dictionary be used instead of the standard pattern. Patent Document 2 discloses a voice recognition technique in which in order to generate a recognition word dictionary for unspecified individuals, a feature parameter of a word pronounced by a small number of people is compared with an ordinary standard pattern generated based on voice of a large number of people such that a phoneme dictionary is generated from the comparison results and is used for voice recognition.
Patent Document 3 discloses a technique of recognizing voice to control the operation of a mobile terminal (mobile terminal, etc.) equipped with a voice recognition device.
Patent Document 4 discloses a technique of automatically converting input voice to text data in a mobile terminal (mobile terminal, PDA, etc.) equipped with a voice recognition device and a text conversion device.
Since the mobile terminal is required to be compact and inexpensive, it is practical that the mobile terminal is equipped with a relatively inexpensive voice recognition device having a simple recognition dictionary (standard pattern). In this case, a user updates the recognition dictionary of the mobile terminal according to his/her situation (that is, the user customizes the recognition dictionary).
When the user customizes the recognition dictionary mounted in the mobile terminal, if the procedure or manipulation thereof is complicated, inconvenience is caused to the user of the mobile terminal. Therefore, a technique that does not make it hard for the user to use and allows the user to easily extend the recognition dictionary (standard pattern) is required. Further, for example, when part of a large amount of voice data is edited, since an operation of inputting a large amount of voice data from its beginning produces very low efficiency, a technique of conveniently editing the voice data is required.
Patent Document 1: JP-A-11-190997
Patent Document 2: JP-A-5-188988
Patent Document 3: JP-A-2004-153306
Patent Document 4: JP-A-2003-188948
Patent Document 5: JP-A-2000-276184