A methodology for discovering the relationships among different types of object has been proposed, known as Probabilistic Latent Semantic Analysis (PLSA). PLSA is a flexible latent class statistical mixture model that is useful in various types of applications, such as information retrieval, web mining, collaborative filtering, and co-citation analysis.
The PLSA model is typically trained using conventional expectation-maximization algorithms. However, both the storage and computational costs of such training becomes expensive as the training data set becomes large. The training time and storage space requirements are each proportional to the training data set size and the number of latent classes. Accordingly, this makes PLSA unsuitable for use with large amounts of training data.
For instance, web mining and personalized web searches typically involve an environment having an exceedingly large training data set. Where there are, for example, one hundred million collected records and ten thousand latent classes (which is not unusual in this context), the computation cost for each iteration of an expectation-maximization algorithm can be as high as 10e12. Indeed, when the number of users in such an environment is one million (a small number of users in the context of the Internet), the storage requirements for storing the PLSA model parameters will be 10e10, which is simply not practical, especially since the parameters would typically need to be stored in main memory to be useful. It can be seen, therefore, how the standard PLSA training algorithm does not work well for larger data sets.