This invention relates to a method for the segmentation of speech, in which an acoustic speech signal is converted into N signals, each signal pertaining to a time interval i of N successive time intervals, in which i runs from 1 to N inclusive. The invention also relates to an arrangement for carrying out the method.
The method and/or an arrangement of this type is used for determining (the limits of) segments from which speech is built up. These segments are sounds, for example in the form of demisyllables, phonemes or diphones.
The aim of an operation such as this can be, for example, to apply a recognition procedure to the results obtained. In this case one speaks of word or speech recognition. This recognition procedure can means that the segments obtained are compared with sound reference patterns. If there is sufficient agreement between a segment and a sound reference pattern the segment is recognized.
Another possibility can be that the segments obtained are used to build up a collection of sounds which are all different (for example diphones), after which a synthesis of artifical speech can later be made with the aid of this collection, see for example "Exploring the possibilities of speech synthesis with Dutch diphones" by B. A. G. Elsendoorn et al (1). Up to now the building-up of a collection (or library) of diphones for one language has been done manually by a trained phonetician and this takes about one year.
A system for obtaining collections of diphones is known from "Fabrication semi-automatique de dictionnaires de diphones" by M. Stella (2). This semi-automatic method segments only 72% of the diphones effective, so that an operator has to correct the results interactively afterwards.
A segmentation method based on reference patterns of the sounds to be found, namely demisyllables, is described in "A bootstrapping training technique for obtaining demisyllable reference patterns" by L. R. Rabiner et al (3).
A disadvantage of a method such as this is that if accurate reference patterns are to be derived, the building up of a library with such reference patterns takes up a great deal of time. Often just as much time as is needed at present to build up a library of diphones in the known way. This is mainly because the number of reference patterns for a library such as this is very great, namely for the Dutch language approximately 10,000 demisyllables and 1800 diphones.
In his publication "Efficient coding of LPC parameter by temporal decomposition," B. S. Atal (4) also describes a method of segmentation. This method has the disadvantage that the number of segments to be found is not fixed and it does not determine to which sound the segment obtained belongs.