The present invention relates broadly to classifying or partitioning data, and more particularly relates to robust L1-based distributional clustering of multinomial distributions, and a workforce analysis method and workforce management system that implements the L1-based distributional clustering to support human resources allocation based on the clustering.
Clustering is a technique for the classification or partitioning of a data set into different subsets or clusters so that the data in each subset shares some common trait such as proximity with respect to some distance measure. Distance measures, for example, in L1-based clustering, provide for determining a similarity of two elements, which influences the shape of the clusters. Data clustering is a common technique for statistical data analysis, where data clustering algorithms are known to be hierarchical or partitional, for example, for human resources allocation.
To solve the clustering problem an input and output must first be assumed. x(i)εD is the assumed input, representing the i-th fraction data (or multinomial distribution) of D-dimensional real-valued vector. xd(i) is d-th dimension value, and must be equal to or greater than 0. The i-th fraction data must satisfy a fraction constraint
            ∑      d                            ⁢                  ⁢          x      d              (        i        )              =  1.An output ξ(i)εD, is the i-th cluster center.
D-dimensional real-valued vector, ξd(i) is d-th dimension value, and must be equal to or larger than 0. The i-th cluster center must satisfy a fraction constraint
            ∑      d                            ⁢                  ⁢          ξ      d              (        i        )              =  1.
One example of a clustering problem with fraction data for human resources allocation is referred to herein as the “staffing template problem,” and a known solution for which will now be described in order to provide a background for the novel L1-based distributional clustering of multinomial distributions of the invention. The staffing template problem expresses characteristics of a staffing project from the viewpoint of human resources. Using the perspective of the staffing template problem requires the forecast of human resources in a service contract (or staffing project) that can be performed. The staffing template problem expresses the resource allocation type of several typical projects, and represents fractions of skill and roles required in the entire project. Because it is expressed as a fraction, or multinomial distribution, the staffing template problem is expressed in such a form that all elements (fractions) when added together equal 1.
Similar clustering problems with fraction data, including the exemplary staffing template problem with fraction data, are known to be solved by first assuming that N data are available, each of which N data represents a fraction or a single trial as a multinomial distribution. That is, each fraction or one of the N data is defined by a multinomial distribution, which represents the probability distribution of the number of successes in “n” independent Bernoulli trials, which have the same probability of success on each trial. To solve the clustering problem, a set of C multinomial distributions, or C fractions, representing the entire set of multinomial data is required.
One staffing template problem expresses one project type, for example, the “development of a package”, a “business transformation” problem, etc. In order to configure a staffing template problem from actual project data, for example, as an allocation of skill and role input in each project, the following conditions must be met. That is, a value of a first dimension is required to express the fraction of hours for which an architect works on the project. A value of a second dimension is required to express the fraction of hours for which an application developer works on the project. To perform a clustering of project data (which is a fraction data), the centers of the obtained clusters must be identified and used as the staffing templates.
The above-described staffing template problem, and the known techniques for solving such problems, however, are not without shortcomings. The Dirichlet distribution is the most natural probability distribution for use in generating a multinomial distribution. Accordingly, a model-based clustering method that utilizes a mixture of Dirichlet distributions represents the most natural solution. However, as is found by a review of the estimated approach in Minka, ESTIMATING A DIRICHLET DISTRIBUTION, Technical Report (2003), when performing actual model estimations, if a “0 entry” (d and i, which meets xd(i)=0) exists, an extreme instability in numerical calculation is caused because calculations include use of log xd(i).
Clustering based on KL distance is found in Duda, et al., Pattern Classification, Wiley-Interscience (2000), describes clustering that is based on Dirichlet distribution. The reference described both hierarchical clustering and model-based clustering using degrees of similarity among probability distributions (such as the KL distance), and information pertaining to the KL distance. However, for such distances, the 0 entry problems, and the possible instability as a result of same, still occur.
Further, as for resource allocation data of the project and other previously mentioned factors, there is a large amount of noise. Hence, the above assumption that all projects are divided into C types is not necessarily 100% correct. Accordingly, performing clustering techniques under circumstances under which there is a large amount of noise and uncertainty, a robust distributional clustering is required. And in order to perform clustering on a regular basis, or performing clustering interactively for data stored in a huge database (which may be updated daily), an effective clustering method for efficiently processing large data is a desirable goal. That is, what would be desirable in the field of solving clustering problems such as the staffing template problem is an L1-based distributional clustering method and system for fraction data (multinomial distribution), which can appropriately treat a 0 entry that is both “robust” and “effective”.