Text-to-speech systems typically synthesize speech from text input that generate sounds which may be perceived by a listener as inaccurate or imperfect (i.e., flawed). Such imperfection results, because hearing is not a purely mechanical phenomenon of wave propagation, but also a sensory and perceptual event to the listener. In other words, when the listener hears a sound, that sound arrives at the ear as a mechanical wave traveling through the air which is transformed by the ear into neural action potentials that travel to the brain where they are perceived. Hence, for acoustic technology such as audio processing, it is advantageous to consider not just the mechanics of the environment, but also the fact that both the ear and the brain are involved in a listener's experience.
The inner ear, for example, does significant signal processing in converting sound waveforms into neural stimuli, though not all differences between sound waveforms are perceived. Specifically, there are sensitivity limits when dealing with individual sound waveforms such as volume and frequency. Most of these effects are non-linear in that perceived loudness depends on intensity level as well as on frequency, i.e., loudness depends on sound intensity level non-linearly. The human ability to identify absolute frequency levels is also limited. Furthermore, it is especially hard for humans to differentiate audio signals having different phase information. Perception of human hearing is also affected when two signals need to be processed nearly at the same time.