1. Field of the Invention
The present invention relates to speaker identification in general, and to determining whether a specific speaker belongs to a group of known speakers in particular.
2. Discussion of the Related Art
Speaker identification generally relates to determining whether a specific speaker, whose voice is available, belongs to a known group of speakers referred to as the target-set, whose voice prints are available, too. Environments in which such determination is required include but are not limited to call centers, trade floors, or law enforcement organizations, which capture and/or record voices involved in a conversation, whether the conversations were made by the people calling a call center, or have been lawfully intercepted by a law enforcement agency. Such environments often retain a collection of target speaker voices, i.e. voices of speakers for whom it is required to know when they are calling or calling or otherwise interacting with the organization. Thus, it may be required that an alert is generated whenever activity involving a member of the target set, i.e. a target, is detected. Exemplary target sets include impostors known to a credit card company, criminals or terrorists known to a law enforcement organization, or the like. Speaker identification is often confused with speaker verification, in which a system verifies whether a caller claiming to be a person known to the system, is indeed that person. For that end, the system ought to have a characteristic, such as a voice print of the known person, to which the voice print of the calling person is compared. Speaker identification, optionally when combined with speaker verification, can also be used for applications such as fraud detection. In fraud actions, a speaker is attempting an identity theft, i.e. pretends to be another, usually innocent person. Speaker identification can help in verifying or invalidating the identity of the person, and in assessing whether the speaker belongs to a group of known fraudsters.
Within speaker identification systems, there are two main branches, being closed set identification and open set identification. In closed-set identification, it is known that the speaker is indeed one of the target-set speakers, and the problem is to identify him or her. In open-set identification, the speaker may be one of the target set speakers or not. In such case, it is required to determine whether the speaker belongs to the target group or not, and in the first case, who is target within the group that matches the speaker's voice. The decision is usually done in two stages: first, determining the speaker in the target-set whose voice has the highest score of matching the specific speaker; and deciding whether the highest score at hand is beyond a predetermined threshold so that it can be declared that the speaker at hand indeed belongs to the target group.
Open set identification is naturally a more complex problem, and is therefore addressed less often. Three types of errors exist in open-set identification: 1) False Acceptance (FA) error—occurs when the maximum score is above a threshold, when the speaker is a non-target speaker, i.e. does not belong to the target set. 2) False reject (FR or miss) error—occurs for a target speaker when the score is below the threshold. 3) Confusion error—occurs for a target test when the highest score is indeed above the threshold, but the wrong target model is the one that yielded the highest score. The confusion error is eliminated when it is only required to know whether the tested speaker belongs to the target-set or not, and the exact identity of the speaker is not important. As the predetermined threshold is for declaring a voice to belong to a member of the target set is raised, the probability of false accept errors will drop, while the probability of false reject errors will increase, and vice versa. Choosing a threshold depends, among additional factors, on the type of chances the organization is willing to take. For example, a law enforcement organization might be ready to invest the extra work resulting from higher rates of false acceptance, while financial organizations are more likely to weight the monetary loss caused by a cheating impostor against the extra effort required to test significantly more calls. Thus, under similar conditions, a law enforcement organization is likely to use a higher threshold, and allow the system to declare more speakers as belonging to the target set and generate alerts for more speakers, than a monetary organization would tolerate.
Known problems in the art related to open-set speaker identification include: selecting the relevant parameters for creating a voice print, both of the speakers in the target set and of the tested voice; weighting or otherwise combining the matching of the features; choosing a threshold beyond which a person whose voice is tested is determined to belong to the target set; constructing one or more background models, which are generally used as a normalization factor or reference to determine the matching between a tested voice and a target voice. However, real-life speaker identification imposes additional difficulties to those handled in an academic or other “ideal” environment. Among these, the following problems are known to cause reduced performance: 1. Often the two or more sides participating in a conversation are captured as a single stream, such that the voices to be analyzed are not separated from other voices. The combined conversation reduces the ability to achieve a separate recording of the desired voice for testing. Moreover, even of the separation between the sides participating in the conversation was satisfactory, often the “other” side, such as the agent in a call center conversation, who is generally of no interest, was the one to yield a higher score and be detected as a target, thus causing extra useless work for an evaluator. 2. Since the target set might comprise hundreds, thousands or more voices, checking each single tested voice against all of them might take a long time, and produce results, if at all, in unacceptable time delay. 3. The quality of the voice might be sub-optimal, which may significantly harm the reliability of the results. 4. The voices in the target sets might have different behavior, i.e. the comparison between a tested voice and two (or more) voice prints belonging to two (or more) targets may produce unequally significant results. However, the comparison is made using a uniform threshold, which may lead to skew results. 5. In open-set speaker identification systems, a known phenomenon is that the false accept error rate increases substantially when the size of the target set increases.
There is therefore a need for a method and apparatus that will provide results improved both in quality and in speed for open-set speaker identification. The method and apparatus should provide improved solution for known problems related to open-set identification, as well as to problems such as combined conversations, agent separation, poor quality conversations, and variations of behavior within target set members.