There have been proposed techniques relating to a remote conference through a network. For example, in a known multi-point conference system, a plurality of conference apparatuses and a conference server apparatus are connected through a network, and the conference server apparatus combines images and voices received from the plurality of conference apparatuses, and distributes the combined result to the plurality of conference apparatuses. An image combining unit of the server apparatus stores an XML file defining a screen combined layout of the combined image to be distributed to each conference apparatus. This screen combined layout of the XML file can be selected for each conference apparatus at an arbitrary time such as during a conference or before a conference starts. The image combining unit combines images of conference participants based on the layout set for each conference apparatus according to the definition file, and distributes the combined image to each apparatus.