The desirability of doing speaker identification on multiparty telecommunication conferences is well recognized in the prior art. It is desirable to do speaker identification during multimedia telecommunication conferences such as video so that the speaker video image can be highlighted allowing the other parties to the conference to see the expression on the speaker's face more clearly. In addition, if a record is to be made of the telecommunication conference either audio or audio text, it is desirable to be able to identify the speaker of each segment of the recorded conference. In some prior art systems, the speaker was assumed to be producing the voice stream that had the loudest audio signal. However, this technique fails if one of the parties was in a noisy environment such as in an automobile or in a room with a loud air conditioning system. Other prior art systems have utilized signal processing on all of the audio streams coming into the conference to determine who the speaker or speakers were at any instant of time during the telecommunication conference. The drawback of this system is that a large amount of signal processing must be performed in order to identify one or more speakers in a telecommunication conference.