Many media content items, such as video streams or audio streams, include speech and non-speech sounds. For the speech sounds (e.g., spoken words, sung words), captions may be added to the content item so that the content may be consumed without needing to hear the audio stream of the content. A very large number (e.g., millions) of such content items may be uploaded to an online content system every day. However, not all of these content items are uploaded along with captions. While these captions may later be added by an automated speech recognition system, the accuracy of such captions if often very poor. The captions could also be added by other users (e.g., volunteers), however these volunteers may have to manually time the beginning and ending timestamps for each caption such that the caption matches the beginning and ending timestamps for the speech sounds in the content. This may be inconvenient for the users to perform, and may discourage volunteers from providing captions.