In real world perception, one is often confronted with the problem of selectively attending to objects or sources that emit signals. Unfortunately, there are an incredible variety of acoustic signals. For example, the human voice can be used for speech and singing, musical instruments, such as strings, woodwinds, and percussion are another class of acoustic signals. Acoustic signals can come from natural sound, such as animals, and the environment as well as from man-made sound sources.
Humans typically have no difficulty in separating both well known sounds as well as novel sounds. However, for generative models, the great variety of possible sounds presents a modeling problem. It is difficult to construct a large model that can be applied to any type of sound. In addition, sounds can obscure each other in a state-dependent way. Typically, the states of all sounds determine which sound dominates a particular part of the acoustic spectrum.
For example, the speech from different people are intermingled in a single mixture of signal in what is known as the cocktail party effect. Humans are capable of focusing auditory attention on a particular stimulus while filtering out a range of other stimuli. This is exemplified by the way that a partygoer can focus on a single conversation in a noisy room. In acoustic signal processing, this is known as auditory scene analysis, which seeks to identify the components of acoustic signals corresponding to individual sound sources (such as people's voices) in the mixture signal.
Even though the components of the sound correspond to objective entities or events in the world, how one can precisely define an analysis of the signal into components can differ depending on the purpose of the analysis. There may be different criteria for analysis and different levels of categorization that are considered when defining the component structure to be analyzed.
For example, many types of sound naturally admit a hierarchical decomposition of components and their sub-component parts. In speech, one person's voice could be considered a component at one level of analysis, whereas each word in the person's speech could be considered a component at a more detailed level of analysis. Moreover the sound from a group of speakers could be considered a component if the task is to separate all speech from non-speech. Alternately the division of speech into components could consider male speech and female speech as two different components.
Similarly, in music there are natural hierarchies of components and sub-components. At the highest level of analysis is the whole ensemble of sound, followed by different groups of instruments, at a lower level of analysis, different individual instruments, and finally individual note events. The components representing groups of instruments could be defined by different criteria, such as the category of instrument (e.g., flutes versus clarinets), or by the melodic or rhythmic part (e.g., themes versus accompaniment) that the instruments play.
Despite the fact that there may be different and even conflicting definitions of the components of a signal, one can define a particular component structure for a given task. For example, separating speech from non-stationary noise is a clearly defined task. The definition of the component structure can be made concrete by the use of databases of examples of acoustic data containing a mixture of speech and non-stationary noise, and the example components speech and non-stationary noise. By arbitrarily mixing together speech and non-stationary noise signal components, one can define a large problem space spanned by an arbitrarily large set of examples, that well represents the target application.
In general, however, separating speech from non-stationary noise is considered to be a difficult problem. Separating speech from other speech signals is particularly challenging because all sources belong to the same class, and share similar characteristics. Separating speech from same gender speakers is one of the most difficult cases because the pitch of the voice is in the same range.
When the goal is to analyze a complex acoustic scene into its components, different sounds may be overlapping and partially obscure each other in an analysis feature space, the number of sound and sound types may be unknown, multiple instances of a particular type may be present.
These problems can addressed by treating the analysis as a segmentation problem, where a set of analysis feature elements in a signal is formulated via an indexed set of analysis features derived from the signal. Each element includes the analysis feature values which are typically multi-dimensional representations of a tiny part of the signal.
In order to use the elements to distinguish components, the elements have to be designed such that in general each element mainly represents a part of only one of the components. This is approximately true in some cases. For example, in a mixture of speakers analyzed by a short-time fourier transform, a large percentage of the time-frequency bins are dominated by one speaker or another. In this sense, the elements dominated by a single component correspond to that component, and if they can be identified, they can be used to reconstruct an approximation to that component. Partitioning the signal elements into groups thus can provide a means of segmenting the signal into components.
Although clustering methods could be used for segmentation, segmentation is fundamentally different. Clustering is typically formulated as a domain-independent problem based on simple objective functions defined on pairwise point relations. In contrast, segmentation usually depends on complex processing of the entire input, and the task objective can be arbitrarily defined using training examples with segment labels.
Segmentation can be broadly categorized as: class-based segmentation where the goal is to label known object classes based on learned object class labels; and partition-based segmentation where the task is to segment the input based on learned partition labels without requiring object class labels. Solving a partition-based segmentation problem has the advantage that unknown objects can be partitioned.
In single-channel speech separation, the time-frequency elements of the spectrogram are partitioned into regions dominated by a target speaker, either based on classifiers or generative models. Deep neural networks can also be applied to class-based segmentation problems.
However class-based approaches have limitations. The task of labeling known classes does not address the general problem in real world signals where there can be a large number of possible classes, and many objects may not have a well-defined class. Also, it is not clear how to directly apply conventional class-based approaches to more general problem. Class-based deep network models for separating sources require explicitly representation of output classes and object instances in the output nodes, which leads to difficulties in the general case.
Although generative model-based methods can in theory be flexible with respect to the number of model types and instances at test time, there remain great difficulties in scaling inference computationally to the potentially much larger problems posed by more general segmentation tasks.
In contrast, humans seem to solve the partition-based problem, because they can easily segment novel objects and sounds. This observation is the basis of Gestalt theories of perception, which attempt to explain perceptual grouping in terms of features such as proximity and similarity. The partition-based segmentation task is closely related, and follows from a tradition of work in image segmentation and acoustic separation. Application of the perceptual grouping theory to acoustic segmentation is generally known as computational auditory scene analysis (CASA).
Spectral Clustering
In machine learning, spectral clustering has been used for image and acoustic segmentation. Spectral clustering uses local affinity measures between features of elements of the signal, and optimizes various objective functions using spectral decomposition of a normalized affinity matrix. In contrast to conventional central clustering, such as k-means, spectral clustering has the advantage that it does not require points to be tightly clustered around a central prototype, and can determine clusters of arbitrary topology, provided that the clusters form a connected sub-graph. Because of the local form of the pairwise kernel functions used, in difficult spectral clustering problems, the affinity matrix has a sparse block-diagonal structure that is not directly amenable to central clustering, which works well when the block diagonal affinity structure is dense. The powerful but computationally complex eigenspace transformation step of spectral clustering addresses this, in effect, by “fattening” the block structure, so that connected components become dense blocks, before central clustering.
Although affinity-based methods have been used for unsupervised inference methods, multiple-kernel learning methods can be used to train weights for combining separate affinity measures. This enables one to consider using multiple-kernel learning methods for partition-based segmentation tasks in which partition labels are available, but without requiring specific class labels. Those methods have been applied to speech separation including a variety of complex features developed to implement various auditory scene analysis grouping principles, such as similarity of onset, offset, pitch, and spectral envelope, as affinities between time-frequency regions of the spectrogram. The input features can include a dual pitch-tracking model to improve upon the relative simplicity of kernel-based features, at the expense of generality.
Learned feature transformations known as embeddings are used in a number of application. Unsupervised embeddings obtained by auto-associative deep networks, used with relatively simple clustering procedures can outperform spectral clustering methods in some cases. Embeddings trained using pairwise metric learning, using neighborhood-based partition labels, have also been shown to have interesting invariance properties, see Mikolov et al., “Distributed representations of words and phrases and their compositionality,” Proc. NIPS, 2013, pp. 3111-3119.