A conference support apparatus has been developed to support attendances at a conference to better understand the contents of speeches by generating caption data from the speeches spoken by the attendances. For example, there is a system for automatically generating caption data by performing voice recognition on speeches spoken by a plurality of speakers. There has been suggested a method for eliminating delay of caption data caused by information extraction processing such as voice recognition by correcting display timing of the caption data with respect to video/audio. Further, there has also been suggested a method for recognizing voice spoken by a person who reads back voice spoken by a speaker and displaying a video of the speaker as well as caption data while delaying the video of the speaker, and there has been suggested a method for conducting conference while checking the amount of delay caused by data communication on a screen.
However, even with the above techniques, it is impossible for an attendant at a conference to recognize the amount of delay that another attendant at the conference suffers due to information extraction processing such as voice recognition.