This invention relates to the field of speech processing, and more specifically to the automatic analysis of conversations in unconstrained environments.
Social interactions can be captured using electronic sensors and analyzed by applying audio, speech, and language processing, visual processing, multimodal processing, as well as other human-computer interaction and ubiquitous computing techniques. As described by Gatica-Perez (2009): “the computational analysis of group conversations has an enormous value on its own for several social sciences and could open doors to a number of relevant applications that support interaction and communication, including self-assessment, training and educational tools, and systems to support group collaboration through the automatic sensing, analysis, and interpretation of social behavior”. See Gatica-Perez, D. “Automatic nonverbal analysis of social interaction in small groups: A review”. Image and Vision Computing, Vol. 27, No. 12, pp. 1775-1787, November 2009.
Olguin-Olguin (2007) discloses the use of wearable “sociometric badges” capable of automatically measuring the amount of face-to-face interaction, conversational time, physical proximity to other people, and physical activity levels using social signals derived from vocal features, body motion, and relative location to capture individual and collective patterns of behavior. The percentage of time when an individual was engaged in a conversation was measured. However, the author does not contemplate or teach other methods for the automatic analysis of conversation dynamics such as turn-taking patterns using data captured by such wearable sensors. See Olguin Olguin, D. “Sociometric badges: wearable technology for measuring human behavior”. Thesis (S. M.)—Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2007.
Kim et al. (2008) present a “meeting mediator” system also based on the sociometric badges to detect social interactions and provide feedback on mobile phones in order to enhance group collaboration. Variables used by this system include: speaking time, average speech segment length, variation in speech energy and variation in body movement. The phone visualization was limited to four participants and designed for certain types of collaborations for which balanced participation and high interactivity is desirable. Each of the four participants was represented as colored squares in the corners of the screen. The color of a central circle gradually changed between white and green to encourage interactivity, with green corresponding to a higher interactivity level. Balance in participation was displayed through the location of the circle: the analogy is such that the more a participant talks, the stronger they are pulling the circle closer to their corner. They also displayed each member's speaking time by varying the thickness of the lines connecting the central circle with each member's corner. The visualization was updated every 5 seconds. See Kim, T., Chang, A., Holland, L., and Pentland, A. “Meeting mediator: enhancing group collaboration using sociometric feedback”. In Proceedings of the ACM conference on computer supported cooperative work, pp. 457-466, 2008.
Jayagopi et al. (2009) used the following features to model dominance in group conversations that were recorded using video and a circular microphone array: total speaking energy, total speaking length, total speaker turns, speaker turn duration histogram, total successful interruptions, and total speaker turns without short utterances. However participants were asked to wear both a headset and a lapel omnidirectional microphone and were constrained to a meeting room. Three cameras were mounted on the sides and back of the room. See Jayagopi, D. B., et al “Modelling Dominance in Group Conversations Using Nonverbal Activity Cues,” IEEE Transactions on Audio, Speech and Language Processing, Vol. 17, No. 3, pp. 501-513, March 2009.
Salamin and Vinciarelli (2012) propose an approach for the automatic recognition of roles in conversational broadcast data, in particular news and talk-shows. The approach makes use of behavioral evidence extracted from speaker turns to infer the roles played by different individuals. Their approach consists of (1) extracting the turns using a “speaker diarization” approach that gives a list of triples:S={(s1,t1,Δt1), . . . ,(sN,tN,ΔtN)}
where N is the number of turns extracted by the diarization approach, siεA={a1, . . . aG} is a speaker label, G is the total number of speakers detected during the diarization, ti is the starting time of turn i, and Δti is its length.
The turn sequence S provides information about who speaks when and for how long. This makes it possible to extract features accounting for the overall organization of turns as well as for the prosodic behavior of each speaker. The second step in their approach consists of (2) extracting different features that account for the way an individual participant contributes to the turn organization (total number of turns for current speaker, time from the beginning of recording to first turn of current speaker, average time between two turns of current speaker) as well as features that account for how a particular turn contributes to the overall turn organization (turn duration, time after last turn of the current speaker, among others). After the feature extraction step, the sequence S of turns is converted into a sequence X={x1, . . . , xN} of observations, where the components of vectors xi correspond to some of the features described previously. See Salamin, H., and Vinciarelli, A. “Automatic Role Recognition in Multiparty Conversations: an Approach Based on Turn Organization, Prosody and Conditional Random Fields”. IEEE Transactions on Multimedia, Vol. 14, No. 2, pp. 338-345, April 2012.
Kim et al. (2012) describe a set of features used to study turn-taking patterns:
1) Turn duration statistics. Mean, median, maximum, variance and minimum of speaker turns duration in the clip as well as the average number of turns.
2) Turn-taking count across speakers. This information can be modeled with bigram counts, i.e. the number of times participants from different groups take turns one after each other.
3) Amount of overlap relative to the clip duration.
4) Turn keeping/turn stealing ratio in the clip. The ratio between the number of times a speaker change happens and the number of times a speaker change does not happen after an overlap. See Kim, S., Valente, F., and Vinciarelli, A. “Automatic detection of conflicts in spoken conversations: ratings and analysis of broadcasting political debates”. Proceedings of IEEE International Conference on Audio, Speech and Signal Processing, 2012.
Dong et al. (2012) describe an approach based on Markov jump processes, to model group interaction dynamics and group performance. They estimate conversational events such as turn taking, backchannels, turn-transitions and link the micro-level behavior with macro-level group performance. The authors define a speaking turn as one continuous segment of fixed length (e.g. not less than 1.5 s) where a participant starts and ends her/his speech. They model the following aspects of the turn-taking structure: (i) taking the turn: if nobody is speaking and somebody takes the turn; (ii) backchannel: a situation where a subject Y speaks after a subject X for less than 1 s. (e.g. “yes” or “uh-huh”); (iii) speaker transitions: if somebody is ending a turn and another person takes the turn; (iv) turn competition: a situation in which two subjects are speaking at the same time and one ends before the other. See Dong, W., Lepri, B., Kim, T., Pianesi, F., and Pentland, A. “Modeling Conversational Dynamics and Performance in a Social Dilemma Task”. 5th International Symposium on Communications Control and Signal Processing (ISCCSP), 2012.
U.S. Pat. No. 6,246,986 discloses an interactive voice response unit (VRU) that controls at least one prompt delivered by such unit, including a recognizer that discards input signals that fail to meet usefulness criteria and a phrase detector (claim 1). It also discloses a VRU where the phrase detector includes a turn-taking module that ascertains the rate at which the phrase detector detects significant signal segments (claim 18), the lengths of silences between significant signal segments (claim 19), inflections at which it detects significant signal segments (claim 20).
U.S. Pat. No. 8,126,705 describes a system and methods for automatically adjusting floor controls for a conversation. A method of identifying a conversation includes the steps of extracting streams of feature data from a conversation and analyzing them in various combinations of users to identify a conversation between two or more users. Another method receives one or more audio streams, distinguishes one or more audio sub-streams, mixes the sub-streams, analyzes one or more conversational characteristics of two or more users and automatically adjusts the floor controls.
U.S. Pat. No. 8,131,551 discloses a system and method for controlling the movement of a virtual agent while the agent is speaking with a human. The method receives speech data to be spoken by the virtual agent, performs a prosodic analysis of the speech data, selects matching prosody patterns from a speaking database, and controls the virtual agent movement according to selected prosody patterns.
Prior systems and methods assume that the number of participants in the conversation is fixed and known a priori, that all participants in the conversation are co-located, and that the turn characteristics are pre-defined (e.g., the length of a turn or the length of a pause between turns).