This specification relates to organizing, filtering, and accessing content in a content management system.
Various knowledge management systems have been implemented using different approaches to content classification, as well as different approaches to viewing the data contained in the system. Numerous methods have been developed to address content categorization and visualization. These methods have included the use of both symbolic knowledge representation and statistical machine learning techniques.
A symbolic knowledge representation is typically referred to as an ontology. In computer science, an ontology generally refers to a hierarchical knowledge structure that contains a vocabulary of terms and concepts for a specific knowledge domain and contains relevant interrelationships between those terms and concepts, typically represented in a tree data structure. A traditional symbolic knowledge ontology is typically constructed by hand by one or more domain experts, who will typically define the top level categories which form the structure of the ontology, and then manually fill in this structure. Human knowledge engineers also maintain and update this structure as new categories are created or discovered.
Another approach used in content management systems involves machine learning techniques. In computer science, machine learning typically refers to a class of algorithms that generally employ statistical and probabilistic analysis methods to learn information from designated sample data. In contrast with symbolic knowledge methods, machine learning methods represent knowledge in fuzzier and less precise ways. In a machine learning system, a set of training documents is identified for each category, and the system “learns” the relevant features (keywords and phrases) for each category. When a new document is presented to the system, the document's features are extracted and statistically compared to training document features previously extracted by the machine learning system. The result of the comparison is a set of categories and scores that best match the likely topics in the new document. This approach is scalable but can be very sensitive to the data in the document training set.