1. Technical Field
One or more embodiments of the present disclosure relate generally to editing audio content. More specifically, one or more embodiments of the present disclosure relate to systems and methods for displaying audio waveforms inline with text within an editing user interface.
2. Background and Relevant Art
Computing devices are useful in interacting with multimedia content, such as audio content, in many ways. For example, using a computing device, a user can capture, store, play back, and/or share audio content. In addition, computing devices allow users to edit audio by, for example, trimming unwanted noise, changing the audio characteristics for an audio file, and mixing audio together. Further, computing devices are often used to convert audio data to other types of data. For example, using a computing device, a user can transcribe audio data into text using speech-to-text (“STT”) technologies and/or convert audio data to a graphical representation of the audio data (e.g., a waveform). Accordingly, conventional audio processing systems provide a number of advantages and conveniences. However, conventional audio processing systems, suffer from a number of drawbacks and shortcomings as well.
For example, using conventional systems, audio editing can be difficult and confusing, especially for novice users. To illustrate, many conventional audio editing systems display audio as a continuous waveform (e.g., representing the amplitude of the audio content over time). Interacting with the waveform to perform edits can be confusing and unintuitive for users. For example, unlike video and other multimedia, audio waveforms do not contain frames or other visual cues that provide context to the user (e.g., who is speaking or when speakers change during an audio sample, where a particular phrase is in the audio sample, when the waveform is representing spoken word versus music, etc.). Oftentimes, even expert users cannot readily decipher the audio to which a waveform corresponds. As a result, even with the proper training and experience, editing audio waveforms can be a complex and cumbersome process.
Some conventional audio editing systems and methods use STT technologies to provide text derived from audio content. This may provide a user with text corresponding to words recognized in an audio sample. However, providing the text derived from an audio sample does not give any indications of time, or any other context beyond the words themselves. For example, the text does not indicate when a pause occurs between words, or when the speaker changes. Further, if there is audio content that is not recognizable as speech—such as applause, music, sound effects, or other noise—this information is not properly represented in the text transcription. Accordingly, even in systems that provide text transcriptions of the audio content, it is still difficult for the users to accurately correlate the text to the audio content or to use such information to aid in editing the audio content.
These and other problems exist with regard to displaying multimedia, and in particular, displaying audio in a manner that is convenient and understandable to all users.