A video conference system includes an endpoint device that captures audio-visual information from participants in a room during a conference, for example, and then transmits the audio-visual information over a network for presentation of the audio-visual information at remote endpoint devices, at other locations, joined in the conference. Identifying all of the participants at all of the locations that are joined in the conference, and which of the participants is/are actively talking at any given time, can help achieve a satisfying user experience, but presents a technical challenge.