Some personal and environmental conditions make it hard to understand the content of media watched because the sound is not audible. In the absence of sound or reliable sound, media providers, including media sources and devices that display the media provided by the sources, may utilize one or more of subtitles and captions. Subtitles refer to textual representations of what is spoken by a character/entity viewable on a device in the media. Captions refer to textual representations of objects and how the objects interact within an environment in the media, rendered by the device. For example, while a subtitle may provide text on a graphical user interface that includes the words spoken by a character, contemporaneously with the character's speech, captions reflect noises, such as a crashing noise when a character visible in the interface knocks over an object, such as a lamp, on the screen. Another caption may describe the journey of the lamp from the table to the floor. Textual representations of audio content, including both subtitles and captions, are useful to individuals viewing media on personal computing devices (e.g., in loud environments, when the sound can be difficult to decipher), watching content in a public setting, where the volume settings are not in the control of the individual (e.g., watching a movie in a cinema), viewing content with audio in a language that they do not speak, and participating in an online course. Current approaches to providing audio content, textually, involve providing a generic audio to textual translation of words and actions, as represented by subtitles and captions. However, present solutions to providing textual content in place of or to supplement audio content are largely a one-size-fits-all approach, meaning that the same content is provided to all users experiencing the media.