With the development of the video conferencing field, user conference sites develop from displays with one camera, one active video, and one active image to displays with multiple cameras, multiple active videos, and multiple active images. These displays with multiple cameras, multiple active videos, and multiple active images in a same site are associated according to a physical or logical relationship. For example, site A is a three-screen site, site B is a two-screen site, and site C is a single-screen site. Camera-1 of site A can capture an image of a participant at location-1 in site A, and the image is displayed on screen-1 of site A, site B, or site C.
In order to implement media negotiation and selection in a conference, some roles are defined in an existing standard specification to identify different media data objects in the conference. These roles include: slides (slides), a speaker (speaker), a sign language (sign language, sl), a main device media stream (main), and an auxiliary device media stream (alt). In a conference establishment process, media stream negotiation and selection is completed according to the foregoing definition.
However, when the foregoing role-based definition manner is applied to a telepresence conference environment with multiple devices and multiple active videos, the number of supported media data streams is limited, and it is difficult to represent multiple media data streams in a multi-stream conference, which causes limitation. For example, the foregoing definition can only be used in a scenario with one media stream for main, one media stream for alt, and one media stream for slides, and it is difficult to distinguish media streams when the number of the media streams increases.