The present invention relates to word syllabification, typically for use in a text to speech system for converting input text into an output acoustic signal imitating natural speech.
Text-To-Speech (TTS) systems (also called speech synthesis systems), permitting automatic synthesis of speech from a text are well known in the art; a TTS receives an input of generic text (e.g. from a memory or typed in at a keyboard), composed of words and other symbols such as digits and abbreviations, along with punctuation marks, and generates a speech waveform based on such text. A fundamental component of a TTS system, essential to natural-sounding intonation, is the module specifying prosodic information related to the speech synthesis, such as intensity, duration and fundamental frequency or pitch (i.e. the acoustic aspects of intonation).
A conventional TTS system can be broken down into two main units; a linguistic processor and a synthesis unit. The linguistic processor takes the input text and derives from it a sequence of segments, based generally on dictionary entries for the words and a set of appropriate rules. The synthesis unit then converts the sequence of segments into acoustic parameters, and eventually audio output, again on the basis of stored information. Information about many aspects of TTS systems can be found in "Talking Machines: Theories, Models and Designs", ed G Bailly and C Benoit, North Holland (Elsevier), 1992.
The transcription of orthographic words into phonetic symbols is one of the principal steps carried out by text-to-speech systems. Conventionally, a TTS would look up words to be syllabified in a dictionary to determined the syllabification thereof. However, as language is constantly evolving, new words often do not have a corresponding entry in the dictionary. Therefore syllabification using a dictionary look up technique cannot be used for such new words.
A further problem with many conventional text-to-speech systems is that although the pronunciation of similar combinations of letters or syllables varies according to their context conventional systems do not take account of such variations. For example, in ascertaining the pronunciation of the word "loophole", only in light of knowledge of the pronunciation of the word "telephone", the consonant cluster "ph" might be pronounced "F". However, if the pronunciation of the word "loophole" were determined only in light of the known pronunciation of "tophat", the consonant cluster might be pronounced as "P" "H". The determining factor as to how clusters of letters are pronounced is dependent upon where the syllable boundaries are within a word. Possible syllable structures for the word "loophole" might be "loop"+"hole", or alternatively "loo"+"pho"+"le", or maybe "looph"+"o"+"le".
The syllable boundaries in a given observed word often, but not always, coincide with the morphological boundaries of the constituent parts of each word. However, so as not to confuse the question of the derivation of a word from its roots, prefixes and suffixes, with the question of the pronunciation of the word in small discrete sections of vowels and consonants, the term morphology is not used here. Strictly speaking the term syllable might be more accurately applied only after transcription to phonemes. However, it is used here to apply to pronunciation units described orthographically. Having identified the most probable sequence of syllables constituting the word "telephone" the information so identified is passed to the phonetic transcription stage to enable better judgements to be made in relation to the pronunciation thereof and in particular to the pronunciation of consonant and vowel clusters.
Hand-written rule sets can be determined, defining the transcription of a letter in context to a corresponding sound. These essentially view the transcription process as one of parsing with a context-sensitive grammar.
Further, some approaches have used additional information such as prefixes and suffixes and parts-of-speech to assist in resolving cases of ambiguous pronunciation. When the phonetic transcription problem is bounded, as is the case for the transcription of proper names, prior art techniques can be employed to improve accuracy of the transcription. The prior art techniques may include, for example, detecting the language of origin of the name and using different spelling-to-sound rules.
Each of the above methods have respective advantages and disadvantages in terms of computational speed, complexity and cost. However, the above prior art methods do not always accurately transcribe new words, neologisms, jargon or other words not previously encountered.