Healthcare industry is one industry that involves maintenance of various records from birth certificate to death certificate of a person. Such records may include, but are not limited to, medical diagnostic records, medical insurance records, hospital data, etc. This record data may be utilized to generate a mathematical model that may have a capability to identify/predict information such as, but not limited to, a health condition of a patient, and health insurance frauds. In order to generate the mathematical model, one or more patterns need to be identified in the record data.
Data mining techniques enable determination of one or more patterns in the record data. Such patterns may be used to determine clusters in the record data. Clustering is a process of grouping a set of records in the record data based on predefined characteristics associated with the set of records. Some of the commonly known clustering algorithms include, but are not limited to, k-means clustering, density-based clustering, centroid-based clustering, Gaussian mixture models, etc.
A Gaussian mixture model is a clustering technique that assumes that the record data includes one or more components or clusters and data in each cluster is normally distributed (i.e., Gaussian distribution). In order to train the Gaussian mixture model, an input pertaining to a number of clusters present in the record data is received from a user. As discussed above, data in each cluster is normally distributed. Parameters, such as mean and covariance, of the distribution for each cluster can be estimated using expectation-maximization algorithm. In an embodiment, the expectation-maximization algorithm includes determination of a likelihood that a data point or a record corresponds to a cluster. The likelihood is maximized and the parameters of the distribution that lead to the maximized likelihood are selected. The selected parameters are utilized to generate the Gaussian mixture model.
As it is assumed that the data in the clusters is normally distributed, Gaussian mixture models cannot be applied to scenarios where the data is not normally distributed.