In modern content systems, a very large number (e.g., millions) of videos may be uploaded and distributed by and to users of the content system every day. Many of these videos may be associated with captions, which may be created by the uploading user or other individuals. These captions are used in order to present a video to a user who is unable to hear or play the audio accompanying the video. However, many more videos uploaded to the content system may not be accompanied with captions. For these videos, the content system may attempt to generate captions automatically using various automated captioning systems. However, a video may typically have more than one speaker, with the multiple speakers speaking in different orders and for different durations. This type of scenario may be especially challenging for a traditional captioning system. For example, when two speakers in the video speak in close succession to each other, the captioning system may recognize this duration of speech in the video as a single caption, and combine the recognized speech from both speakers in a single caption, implying a single speaker. Such a caption does not accurately reflect the speech in the video, and thus reduces the accuracy of the captioning system, resulting in a poorer user experience. Hence, a more robust and accurate system is desirable.