Currently, there are various video conferencing systems that can conference multiple video streams. Users can call into the video conference and see and hear the other attendees of the video conference. The endpoints that support this type of conferencing include, at a minimum, a microphone, a speaker, a video camera, and a video display. Although some systems permit conference participants to view all attendees simultaneously, a typical multi-point video conferencing system will broadcast to all participants the image of the individual who is presumed to be the current person-of-interest. Current systems that identify the person-of-interest automatically generally do so by analyzing the audio signal. The underlying assumption is that the video image that is transmitted to the conference participants should be that of the person who is speaking. The simpler systems that behave in this manner will switch the video signal based on which endpoint is contributing the strongest audio signal. More advanced systems can distinguish between someone speaking words versus non-verbal sounds such as coughs or background noise. The problem with current systems is that they do not take into account other video events when determining what video feeds to display to the attendees of the video conference. For example, while someone is speaking, another participant may raise his hand or shake his head in response to what is being said, but the system will continue focusing on the person who is currently speaking. There is no mechanism to integrate displaying participants into the conference by focusing on these non-verbal cues of participants in the video conference. For this reason, current systems fail to provide the “full duplex” person-to-person communication experience that can make a face-to-face meeting so much more satisfying than a teleconference.