The present disclosure relates to content management systems, and in particular, to creating and using hierarchical knowledge structures.
The ability to automatically classify and categorize content is an important problem in content management systems. The need to categorize content occurs in both consumer and enterprise related content work flows. Numerous methods have been developed to address this problem. These methods typically use either symbolic knowledge representation or statistical machine learning techniques.
A symbolic knowledge representation is typically referred to as an ontology. In computer science, an ontology generally refers to a hierarchical knowledge structure that contains a vocabulary of terms and concepts for a specific knowledge domain, such as Bioinformation, and contains relevant interrelationships between those terms and concepts. Symbolic knowledge generally refers to knowledge represented explicitly through precise domain vocabulary words (e.g., “Gene”) and their explicit relationship to other words (e.g., “has subtype” “recessive gene”).
A traditional symbolic knowledge ontology is typically constructed by hand by one or more domain experts (e.g., biologists), and such ontologies are often very detailed and precise, which can present difficulties in search and categorization applications. In a symbolic ontology, a team of human domain experts will typically define the top level categories which form the structure of the ontology, and then manually fill in this structure. Human knowledge engineers also maintain and update this structure as new categories are created or discovered.
For large symbolic ontologies, a tree structure of ontology nodes is frequently created and stored in a database. A database structure called an Adjacency List is normally used. The Adjacency List typically consists of pairs of nodes, each pair representing a parent-child connection between nodes.
Another approach used in content management systems involves machine learning techniques. In computer science, machine learning typically refers to a class of algorithms that generally employ statistical and probabilistic analysis methods to learn information from designated sample data (typically example documents also known as document “training sets”). In contrast with symbolic knowledge methods, machine learning methods represent knowledge in fuzzier and less precise ways, which can provide benefits in terms of scalability and ease of document classification.
In a machine learning system (which may or may not use an ontology) a set of training documents is identified for each category, and the system “learns” the relevant features (keywords and phrases) for each category. When a new document is presented to the system, the document's features are extracted and statistically compared to training document features previously extracted by the machine learning system. The result of the comparison is a set of categories and scores that best match the likely topics in the new document. This approach is scalable but can be very sensitive to the data in the document training set.
There are numerous ontology standards, building and editing tools, and ontology-based document classification systems. Existing ontology products build and edit symbolic ontologies, and various standards exist that describe the semantics of these ontologies. In particular, ISO-39.19 and W3C OWL-DAML using RDF (Resource Description Framework) are common methods for specifying symbolic ontologies. Existing ontology products include those from Ontology Works, Inc. (of Odenton, Md.), Semio Corporation (of San Mateo, Calif.), International Business Machines (IBM) Corporation (of Armonk, N.Y.), Oracle Corporation (of Redwood City, Calif.), Autonomy Corporation (of San Francisco, Calif.), ClearForest Corporation (of Waltham, Mass.), and Stratify, Inc. (of Mountain View, Calif.). In addition, existing classification systems use machine learning techniques such as Latent Semantic Indexing or Bayesian Networks.