This invention relates to the field of tagging of documents with folksonomies. In particular, the invention relates to using tag data to generate a taxonomy for documents from tag data.
As the number of resources of information available in all formats grows at an increasing rate, retrievability of useful information becomes an ever more significant issue. In the vast majority of cases the information that the user requires exists, but the user has difficulty retrieving the required information.
Content-based search is an imprecise method, so owners of information content look for ways to organize their information to facilitate retrieval by subject. A common solution to this problem is the use of classification methods. Classification can be formal, based on a controlled vocabulary, usually a taxonomy; or it can be informal and evolve as the result of social tagging. Although practitioners often refer to the potential for synergy between the two approaches, practical suggestions for combining the two vocabularies are very rare.
A taxonomy provides a consistent and unambiguous structure, whereas social tagging enables users to choose terms that have meaning for them. The collection of tags is commonly known as a folksonomy. Comparing the two approaches, a taxonomy has formal controlled keywords in a hierarchy, whereas a folksonomy is a flat namespace built by the end-users choosing words which have meaning to them. Therein lies the difficulty: to exploit the synergy, an effective translation between the two models must be found.
The formal taxonomy approach has advantages for precise retrieval; however, there are a number of problems in practice. It is a manual process that is time consuming and requires subject matter experts to classify documents using a pre-defined rigid vocabulary. This vocabulary must be agreed in advance, but is then inflexible, and change can only occur over a long period of time because of the need to agree the new taxonomy, or requires migration for those affected by the change. One problem with a formal taxonomy is that the vocabulary is decided by the content owners, and may not match the vocabulary of the content user community. Documents may be classified using terms different from those that users employ, thus hindering retrieval.
As an example, a content owner may classify the content using scientific terminology for example using the Latin species names:
Canis lupus 
Castor canadensis 
Felis rufus 
Microsorex hoyi 
Taxidea taxus 
Ursus arctos 
Vulpes vulpes 
However, the users of the content may not be familiar with the chosen classification, and instead use the following common names:
Grey wolf
Beaver
Bobcat
Pygmy shrew
Badger
Grizzly bear
Red fox
A folksonomy has the advantage of being both dynamic and using the language of the community of content users. Tags are in a sense self-defining in the context of the community of users. The ability to be able to determine the relationship between tags on content provides a number of opportunities for both presenting the content using a structure that is meaningful for the user community, and also making any formal structure of the information more relevant based on the feedback of the user community through the use of social tags.
Manual assessment of tags may be made to create a basic taxonomy. A domain expert may take the list of tags and decide which are most generic, and which are more specific, and then organise these headings in a way which is logical to the subject matter expert to create a taxonomy. If different people (or even the same person at different times) creates the taxonomy in this manner, they would end up with different results.
Scaling any manual system to large numbers of content or social tags such as are found in an information centre or a content hosting website such as Flickr (Flickr is a trade mark of Yahoo Inc.) is not viable. A large website or information set could easily generate over 10,000 tags, and having subject matter experts individually deciding whether it is generic or specific, and assigning their relationships is unfeasible. As the method is ad hoc, there would be no consistency between the decisions that the experts used to make their decisions, so splitting the work would be unreliable.
An expert's intervention also misses the primary value of the tags in providing information about the users understanding of the information, because it takes no account of community knowledge or preferences. The subject matter expert is deciding how the tags and therefore information is related, rather than using the collective knowledge and preferences of the community, by analysing how the tags added by users are related.
Manual taxonomy creation also risks breaking the link between the tags and the documents. The emerging vocabulary of user community tags derives from both the information itself and the community that uses it.