The Carnegie Mellon University (CMU) notation for English language phonemes include:
AA as in odd
AE as in at
AH as in hut
AO as in ought
AW as in cow
AY as in hide
Bas in be
CH as in cheese
Das in dee
DH as in thee
EH as in Ed
ER as in hurt
EY as in ate
F as in fee
Gas in green
HH as in he
IH as in it
IY as in eat
JH as in gee
K as in key
L as in lee
M as in me
N as in knee
NG as in ping
OW as in oat
OY as in toy
P as in pee
R as in read
S as in sea
SH as in she
T as in tea
TH as in theta
UH as in hood
UW as in two
Vas in vee
W as in we
Y as in yield
Z as in zee
ZH as in seizure
FIG. 16 shows a table of CMU notations of American English phonemes and example words.
Modern automatic speech recognition (ASR) technology is improving at an ability to recognize speakers' words, even when speakers have different accents and use different pronunciations of words. Some ASR systems are able to recognize both S AE N JH OW Z and S AA N HH OW S EY as the word “San Jose”. Note that some words, such as “San Jose”, contain multiple parts separate by a space. Some words include hyphens, such as “give-and-take”. Some words are acronyms (pronounced as a word) and initialisms (pronounced letter by letter) that may alternatively be pronounced as individual letters or as if a spoken word, such as “MPEP”, pronounced as EH M P IY IY P IY or EH M P EH P.
Many words have one strongly preferred pronunciation, such as “San Jose”. Some words have multiple generally acceptable pronunciations, such as “tomato”, for which pronunciations T AH M EY T OW and T AH M AA T OW are both generally acceptable. That fact was popularized in the song Let's Call the Whole Thing Off by George and Ira Gershwin. Such systems use a phonetic dictionary to map sequences of graphemes to phonemes. Many systems use proprietary phonetic dictionaries, but CMUdict from researchers at Carnegie Mellon University is a widely used and freely available one.
Some systems have speech synthesis functions that produce audio samples that, when sent to a digital to analog converter, amplifier, and played through a speaker produce speech back to users. They also use phonetic dictionaries, but with one sequence of phonemes for the pronunciation of each word. When they produce speech with a pronunciation that is unfamiliar to a user, it is either disconcerting for the user or completely misunderstood by the user. Either users need to figure out the system's pronunciation or designers need to design systems to use pronunciations that users expect. Designing such systems is impossible, particularly for words with multiple generally acceptable pronunciations. Therefore, what is needed is a system that can teach users common pronunciations and systems that can learn users' preferred pronunciations.