Transcription services convert human speech into corresponding text, so a person can review what was said during a broadcast or a presentation, for example. However, conventional transcription services typically produce deficient results when multiple people are engaged in a conversation because the produced transcript typically includes a single flow text based solely on a time at which word(s) are spoken. A conversation typically captures a collaboration between multiple people, and thus, the person currently speaking often switches over time, one person may interrupt another person, and/or two or more users may speak during a same or an overlapping period of time. The single flow text makes it difficult for a person reviewing the transcript to understand the context of the conversation. For instance, a person reviewing the transcript is often unable to effectively identify a person that spoke a particular group of words. Moreover, the single flow text often mixes the words spoken by different people in a disjointed manner thereby making it difficult to follow the conversation between multiple people.