A video conference system includes a “near-end” endpoint that captures audio and video of participants in a room during a conference, for example, and then transmits the audio and video to a conference server or to a “far-end” endpoint. The near-end video conference endpoint may detect participants in the captured video, their location with respect to one another and to the near-end video conference endpoint, and which one of the participants is an active speaker. The near-end video conference endpoint may also record a speaker history at the video conference endpoint. Different participants at various far-end endpoints, however, may have different requirements for an optimal framing of the captured video of the camera in order to account for each far-end endpoint's specific setup (i.e., various screen sizes, the number of screens, screen locations, etc.). Sending the far-end endpoints multiple streams of altered framings, where each stream contains a different framing of the captured video of the camera, is not an optimal situation as it requires infrastructure support and more resources at the video conference endpoints, and may present compatibility issues with some or all of the endpoints.