It is becoming increasingly useful to detect persons who are co-located in the same physical area during a same time frame. For example, it is useful to identify people who take part in an event, so that attendance for the event may be known, and so that actions and communications from the event may be attributed accurately. Accurate attribution involves identifying people who were at the event, and, in some cases, what they contributed. An example application is identifying people attending a meeting, and identifying what each person said, and when they said it (i.e., automatically scheduling, archiving, and tracking person-to-person meetings and discussions in physical spaces. Example of physical spaces include conference rooms, classrooms, and offices. Any space where groups of people regularly hold discussions or where the area is a shared resource for some activity or event could benefit from such an identification and tracking system. Meeting users may record their meetings using multiple input sources. These input devices can include digital recorders, mobile applications, and phone conference bridges. While in an intimate setting such as a cafe table or a small meeting room a single recording device may suffice, in other cases it is not sufficient to capture good audio quality. One case is a large, noisy space such as a board room or a lecture hall. In this case, the speaker may be far from the recording device and there may be a large amount of background noise. Another example is a hybrid conference call/in-person meeting: a conference room with several participants plus a call-in number for remote participants. Existing recording systems may not capture the audio of such meetings with consistent or acceptable quality.
Deep neural networks (DNN) and convolutional neural networks (CNN) have previously been used to analyze speech data. Common applications include speech recognition, keyword detection, phoneme recognition, and voice (person) verification. The outputs of these networks may be a phoneme, a character, or a smaller class set.
Many current techniques perform well for given applications, however they often blur the search space more than desired or do not fully obfuscate the index (only add metadata). A classical text index takes the form of a lookup or hash table, which takes a query string as a key and outputs a list of matching points in the string, audio timeline, or any other data object. Because lookups are O(1), this allows rapid search through indexed data. One problem with classical indexing is that the full text must be retained, which for legal, privacy, or security reasons may be untenable. Another problem is classical indices are notoriously rigid and fragile. For example, a query for ‘Steven’ may not return the indexed lookup for ‘Stephen’. Similarly, the word ‘car’ and ‘automobile’ would also return completely different sets of results. Many search techniques exist to make search ‘fuzzy’, including linear discriminant analysis, principal component analysis, soundex indexing, and locality-sensitive hashing.