The present disclosure relates to visual representations of audio data.
Different visual representations of audio data are commonly used to display different features of the audio data. For example, an amplitude waveform display shows a representation of audio intensity in the time-domain (e.g., a graphical display with time on the x-axis and intensity on the y-axis). Similarly, a frequency spectrogram shows a representation of frequencies of the audio data in the time-domain (e.g., a graphical display with time on the x-axis and frequency on the y-axis).
Speech transcription is a process that provides identifies a script (e.g., English text) from corresponding audio speech. Typically, speech transcription includes performing speech recognition on the audio data. Speech recognition uses one or more techniques to identify audio as corresponding to particular text. Conventional speech recognition application often use techniques based on hidden Markov models, which are statistical models trained to identify text segments (e.g., words or phonemes) likely to correspond to particular audio data.
Additionally, the speech transcription can include a mapping from audio to the corresponding identified text. The mapping can identify, for example, particular points in time corresponding to a beginning or ending of a word (e.g., the mapping can identify, for a particular transcribed word, a beginning time and an ending time in the audio data corresponding to that word).