1. Field
Speech segmentation is the process of identifying the boundaries between words, syllables, or phonemes in spoken natural language. In all natural languages, the meaning of a complex spoken sentence (which often has never been heard or uttered before) can be understood only by decomposing it into smaller lexical segments (roughly, the words of the language), associating a meaning to each segment, and then combining those meanings according to the grammar rules of the language. The recognition of each lexical segment in turn requires its decomposition into a sequence of discrete phonetic segments and mapping each segment to one element of a finite set of elementary sounds (roughly, the phonemes of the language).
For most spoken languages, the boundaries between lexical units are surprisingly difficult to identify. One might expect that the inter-word spaces used by many written languages, like English or Spanish, would correspond to pauses in their spoken version; but that is true only in very slow speech, when the speaker deliberately inserts those pauses. In normal speech, one typically finds many consecutive words being said with no pauses between them.
2. Description of Related Art
Voice activity detection (VAD), also known as speech activity detection or speech detection, is a technique used in speech processing in which the presence or absence of human speech is detected. [1] The main uses of VAD are in speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol applications, saving on computation and on network bandwidth.