There are a number of text-to-speech synthesis systems in the prior art. The two main categories of text-to-speech synthesis systems are: unit-selection and parametric systems. Unit-selection methods find and join audio segments together to achieve the synthesized speech. These systems typically have high quality output but also have several drawbacks such as: requires large amounts of memory, require high computational power, and have limited flexibility to customize to new speakers or emotions. Parametric methods use statistical models to learn the conversion from text to speech. Some parametric systems employ hidden Markov models (HMM) to perform the conversion. Unfortunately, these HMM's are not capable of encoding standard spectrograms of phonemes because the temporal length of these phonemes is highly variable. Instead, HMM's assume that the sequences of speech segments have a temporal dependence on each other in highly short segments (under 100 milliseconds), which is a sub-optimal model for human speech since speech has transitions that have a much longer context (sometimes ranging to several words). Also the spectrograms are altered in order make the data suitable for the HMM's. Unfortunately, these alterations and modelling assumptions impact the accuracy with which the spectrograms can be encoded in the HMM. There is therefore a need for a system that can encode spectrograms without loss of precision or robustness.