Typically, home appliances like AV systems and media servers do not operate in a real multiuser environment. Typically a server/disk having folders with music, movies, photos and other digital files may be accessed unconditionally by all users, all having the same priority and access rights. Everybody may access and process all information like edit files, read files, write/create files, delete files and execute files.
Prior art for a system applicable for multiuser access to—and distribution of multimedia information is disclosed in U.S. Pat. No. 8,924,468.
Prior art for a method for i-vector detection and classification is disclosed in DK PA 201400147.
So far the challenge has been to identify a specific user that might have allocated individual resources. It is very inconvenient to require users to “login on” in an ordinary matter, known from IT systems, to identify themselves.
New methods to identify users via spoken commands and speech recognition is a feasible mode to apply but the existing systems require big resources in terms of online processing in identifying commands and converting these to related control commands.
The i-vector feature extraction approach has been the state-of-the-art in speaker recognition in recent years. The i-vectors capture the total variability, which may include speaker, channel and source variability. Variable-length speech utterances are mapped into fixed-length low dimensional vectors that reside in the so-called total variability space. While it is possible to work directly with the raw i-vector distribution, the fixed-length of i-vectors has resulted in a number of powerful and well-known channel compensation techniques that deal with unwanted channel variability and hence improve speaker recognition performance.
As a good starting point, linear discriminant analysis (LDA) is a non-probabilistic method used to further reduce the dimensionality of i-vectors, which simultaneously maximizes the inter-speaker variability and minimizes the intra-speaker variability. After centering and whitening, the i-vectors are more or less evenly distributed around a hypersphere.
An important further refinement commonly carried out is length normalization, which transforms the i-vector distribution to an (almost) Gaussian distribution that is more straightforward to model. Probabilistic LDA is a generative model that uses a factor-analysis approach to model separately factors that account for the inter-speaker and intra-speaker variation. Many variants of PLDA, in the context of the i-vector approach, have been proposed in prior art.
Another well-known method is within-class covariance normalization (WCCN), which uses the inverse of the within-class covariance matrix to normalize the linear kernel in an SVM classifier. It is typical in i-vector modeling to use multiple techniques in cascade: for example to ensure the Gaussian assumption for PLDA, it is not uncommon to carry out whitening followed by length normalization before the PLDA stage.
Due to the fact that channel variation and source variation both contribute to reducing the ability to discriminate speakers, it is not surprising that the methods proposed to combat channel variation and source variation resemble one another.
When i-vectors are extracted from a heterogeneous dataset, not only will they capture both speaker and channel variability, but also source variation. If this source variation is not dealt with, it will adversely affect speaker recognition performance. The notion of source variation is related to the speech acquisition method (e.g., telephone versus microphone channel types) and recording scenario (e.g., telephone conversation versus interview styles). The various combinations of styles and channel types (e.g., interview speech recorded over microphone channel) form a heterogeneous dataset consisting of relatively homogeneous subsets. In this work, the dataset consists of telephone, microphone (telephone conversation recorded over microphone channel), and interview subsets, or sources.
There have been several proposals to address the issue of source variation within the context of total variability modeling. A phenomenon commonly seen in heterogeneous datasets is the fact that not all sources are equally abundant and most speakers appear in only one of the sources. In the context of LDA, the source variation will be strongly represented and seen as part of the inter-speaker variability and will therefore be optimized in the resulting LDA transform. One proposal to address this issue is to determine a suitable inter-speaker scatter matrix.
For training of the total variability matrix itself, one of the simplest approaches, albeit rather crude, is to simply pool all the training data into a heterogeneous set without distinguishing between microphone and telephone data. A more structured proposal suggests training a supplementary matrix for the microphone subset on top of an already trained total variability matrices on telephone data.
I-vectors are then extracted from a total variability matrix that is formed by concatenating these two matrices. An interesting observation seen with this approach is that the microphone data resides in the combined space defined by the matrix concatenation, whereas the telephone data only resides in the telephone space.
In total variability modeling, a non-informative prior is assumed for the speaker, channel and total variability latent variables, since there is no gain in generality in using an informative prior. This assertion holds at least when a homogeneous dataset is concerned. The notion of informative priors to encode domain knowledge is not a new concept and has been used in machine learning applications before. In the context of continuous speech recognition, informative priors have also been used in the case of sparse data to improve generalization of an infinite structured SVM model.