With the recent advance in communication technology, remote conference systems (television conference systems) and videophone systems available even for the individual have been put into practical use.
In such systems, images and sound are transmitted using communication channels such as telephone circuits, which therefore limits the coded bit rate transmittable per channel. To suppress the amount of picture signal data to less than the upper limit of the coded bit rate, the picture information is encoded before transmission.
Since the coded bit rate transmittable per unit time is insufficient, the coded bit rate for the pictures per frame to ensure natural movements is determined by the transmission rate in transmitting moving pictures.
Generally, coding is effected so that the entire screen may be uniform in resolution. This, however, causes the problem of blurring the picture of the other party's face. Normally, a person does not pay attention to the full screen, but tends to concentrate on a significant portion in the screen. Therefore, with the picture quality of the significant portion being improved, even if the remaining portions have a somewhat low resolution, there is almost no problem in understanding the picture.
Viewed in this light, coding methods have been studied which display the face area of a person, a more important source of information, more sharply than the remaining areas in order to improve the subjective picture quality. One of such techniques proposed is using interframe differential pictures (literature: Kamino et al., "A study of a method of sensing the face area in a color moving-picture TV telephone," the 1989 Electronic Information Communication Society's Spring National Meeting D-92).
With this system, the person talking over the telephone is picked up with a television camera. From the picture signal thus obtained, moving portions in the picture are picked up. The face area of the speaker is estimated on the basis of the picked-up area. A large coded bit rate is allocated to the estimated face area and a small coded bit rate is given to the remaining areas. By performing such a coding process, the person's face area is displayed more sharply than the remaining areas.
In cases where such a face-area-pickup method in a moving-picture TV telephone is applied to a conference system, when moving objects other than the person are picked up unintentionally, or when more than one person is picked up with each showing changes of expression, it is difficult to estimate the face area of the speaker.
As described above, when more than one person is picked up or when moving objects other than a person are picked up, there arises the problem of being unable to extract only the face area of the speaker, the most important factor in a method of picking up the face area in a moving picture.
Accordingly, the object of the present invention is to provide a moving-picture coding apparatus capable of estimating the position of the speaker in the video signal precisely, extracting the area of the speaker in the screen accurately, and thereby sharply displaying the area in which the speaker appears.