The present invention relates to a method of classifying data and in particular to a method of classifying online user generated data.
Web-based forums provide a space for online users to share information and seek help from each other. Suppliers of products and services often provide forums to allow support staff to assist users. It is conventional to find the most similar or relevant information (based on posts in online forums) to respond to a user query or post. Typical user submissions to online forums are very complex and often contain multiple concepts. Thus, the information that has the highest overall similarity to a forum submission may not be the best or the only solution to answer a user's query. Also, it may not satisfy all of the potential angles of interests of the user. This means that conventional approaches may fail to deliver some important useful information, which may be interesting to a user but does not have a high overall similarity value.
According to a first aspect of the present invention there is provided a method of analysing a plurality of online posts, the method comprising the steps of: a) extracting a list of keywords from each of the plurality of posts; b) generating one or more keyword clusters based on the keywords extracted from each of the plurality of posts and c) classifying new posts in accordance with the one or more keyword clusters. The method may comprise the further step of: d) allocating a new post to a community in accordance with the result of step c). Also, the method may comprise the yet further step of: e) sending a message to one or more service agents, the one or more service agents being associated with the community to which the post is allocated in step d). Alternatively, or in addition, the method may comprise the further step of: f) sending a message to one or more users, the one or more users being associated with the community to which the post is allocated in step d).
Step a) may comprise the determination of a term frequency-inverse document frequency weighting for one or more potential keywords. The method may comprise the further step of i) forming one or more keyword associations based on the list of keywords extracted in step a), step i) being carried out after step a) and before step b). The keyword associations may be formed in accordance with the correlation of co-occurring keyword pairs. A clique percolation model is applied to determine one or more keyword clusters. In step c), a new post may be be classified as belonging to a keyword cluster if a similarity function exceeds a predetermined threshold. Preferably in step c) the similarity function is a cosine similarity function.
According to a second aspect of the present invention there is provided a tangible computer readable medium comprising computer executable code for performing any of the methods as described above.
According to a third aspect of the present invention there is provided an apparatus comprising a processing unit, volatile and/or non-volatile memory and one ore more data storage devices, the apparatus being configured, in use, to execute performing any of the methods as described above.
In the present invention users' multiple interests are acknowledged and useful information from different categories is searched in order to satisfy user requirements from every related aspect. The present invention proposes a novel approach to managing and organising online forums based on overlapping communities, a property which is increasingly recognised in several types of natural networks (see MM Luscombe et al “Genomic analysis of regulatory network dynamics reveals large topological changes”, Nature, 431:308-312, 2004 & S. Wuchty and E Almaas, “Peeling the yeast protein network”, Proteomics, 5:444-449, 2005). Inspired by these natural systems, a forum is defined for the purpose of the present application as a complex network, in which all entities (keywords, posts and users) may belong to multiple categories underlined by the interactive relationships of the entities.
It is known to form communities on the web, for example for e-learning systems (e.g. S Seufert et al “A reference model for online learning communities”, International Journal on E-Learning, January-March, 2002, pp. 43-55) or in distributed peer-to-peer systems (e.g. M Khambatti, et al, “Structuring Peer-to-Peer Networks using Interest-Based Communities”, Databases, Information Systems, and Peer-to-Peer Computing: First International Workshop, DBISP2P, Berlin, Germany, pp. 48-63 2004). These approaches, however, have a common shortcoming: a component only belongs to a community which has the maximum similarity to the component. They did not take the multiple interests of users/resources into consideration so the resulting clustering was usually quite inaccurate with high noise.
F. Wang, “Multi-interest communities and community-based recommendations”, 3rd International Conference on Web Information Systems and Technologies, 3-6 Mar., 2007, Barcelona, disclosed the use of multi-interest communities to cluster movie data for movie users. This method worked well on data sets, such as movies, which have clear and well defined genres, but may not be able to deal with unstructured, complex and noisy data such a those in an online forum.
The Invention proposed in this report extracts and processes essential features of resources (keywords of posts) by carefully analysing the resources (posts in forums). The keywords obtained are then used to construct a complex graph according to keyword correlations and accordingly overlapping clusters are identified by using the Clique Percolation Method. So this invention automatically generates communities based on the social properties of resources and needs no explicit or implicit constraints as to the number, size, shape or disjoint characteristics of target clusters, as those required by many other clustering methods.
The core keyword clusters formed then absorb relevant posts and users to constitute overlapping communities. Furthermore, the communities are extended to incorporate other pertinent entities (keywords, posts and users) so as to widen the coverage of communities in the forum. The formed overlapping communities provide the foundation to support various services on the forum, such as recommendation, alerting and profiling of customer agents.