The traditional multimedia conference mostly refers to an audio-video conference, wherein even though some multimedia conferences have realized the function of a data conference, from the point of view of realization and signaling, the data conference and the audio-video conference are still independent, and the functions of the data conference and the audio-video conference are fused only from the point of view of users. One important reason for leading to such a realization way is that the audio-video media server and the data media server are independent, and it is hard to fuse those two together, therefore there is almost no any mature media server (MS) which can realize both audio-video media service and data media server. As shown in FIG. 1, it is the schematic diagram of the networking structure of multimedia conference system of the prior art, comprising a SIP (Session Initiation Protocol) soft terminal, a SIP hard terminal, an PSTN (Public Switched Telephone Network) terminal, a core network, an AS, an audio-video MS and a data MS. The AS is the control center of the whole conference system, which is directly connected to two MSs, connected to the terminals through the core network, controls the interaction between the terminals and the MSs at the signaling level, and completes a multimedia conference call. The SIP is employed both between the AS and the core network and between the AS and the MSs.