Systems such as computer games and multimedia applications have evolved to the point where the systems are able to utilize user movement and verbal communication as inputs to the system. Such natural systems may be geared toward multiple users, where it is compelling to distinguish individuals from one another. Techniques exist to allow a game or application to identify users within the field of view through a variety of mechanisms, including a three-dimensional depth camera capable of sensing user traits such as size, facial features, clothing color, etc. Voice recognition techniques also exist to identify perceived user voices through a variety of mechanisms including a microphone array. These two techniques have not conventionally been used in tandem. It would be compelling to automatically match user voices with bodies without involving a deliberate setup on the part of the users. For example, it may happen that a person's identity is ambiguous using imaging techniques alone or audio techniques alone. This is especially true in lower cost consumer systems. In addition to helping to disambiguate users, such a correlation of audio and visual identity can be used to bolster the user experience within the game or application.