Faceted classification is based on the principle that information has a multi-dimensional quality, and can be classified in many different ways. Subjects of an informational domain are subdivided into facets to represent this dimensionality. The attributes of the domain are related in facet hierarchies. The materials within the domain are then identified and classified based on these attributes.
FIG. 1 illustrates the general approach of faceted classification in the prior art, as it applies (for example) to the classification of wine.
Faceted classification is known as an analytico-synthetic method, as it involves processes of both analysis and synthesis. To devise a scheme for faceted classification, information domains are analyzed to determine their basic facets. The classification may then be synthesized (or built) by applying the attributes of these facets to the domain.
Many scholars have identified faceted classification as an ideal method for organizing massive stores of information, such as those on the Internet. Faceted classification is amenable to our rapidly changing and dynamic information.
Further, by subdividing subjects into facets, it provides for multiple and varied ways to access the information.
Yet despite the potential of faceted classification for addressing our classification needs, its adoption has been slow. Relative to the massive amount of information on the Internet, very few domains use faceted classification. Rather, its use has been segmented within specific vertical applications (such as e-commerce stores and libraries). It generally remains in the purview of scholars, professional classificationists, and information architects.
The barriers to adoption of faceted classification lie in its complexity. Faceted classification is a very labor-intensive and intellectually challenging endeavor. This complexity increases with the scale of the information. As the scale increases, the number of dimensions (or facets) compounds within the domain, making it increasingly difficult to organize.
To help address this complexity, scholars have devised rules and guidelines for faceted classification. This body of scholarship dates back many decades, long before the advent of modern computing and data analysis.
More recently, technology has been enlisted in the service of faceted classification. By and large, this technology has been applied within historical classification methods and organizing principles. Bounded by the traditional methods, attempts to provide a fully automated method of faceted classification have generally been frustrated.
As indicative of the state of the art, an example of automated categorization and faceted navigation systems is ENDECA™. ENDECA is recognized as a leader in product excellence in the information categorization and access system industry http://www.usatoday.com/tech/products/cnet/2007-06-29-endece-google_N.htm]
ENDECA's technology uses guided navigation and a meta-relational index which houses the dimensions of the data and documents as well as the relationships among the dimensions: for example, U.S. Pat. No. 7,062,483, Jun. 13, 2006: “Hierarchical data-driven search and navigation system and method for information retrieval”; U.S. Pat. No. 7,035,864, Apr. 25, 2006: “Hierarchical data-driven search and navigation system and method for information retrieval”.
ENDECA's system includes a categorization approach that is described by the company as taxonomy definition and classification: U.S. Pat. No. 7,062,483, Jun. 13, 2006: Hierarchical data-driven search and navigation system and method for information retrieval.
The current state of automated categorization technology is most predominately used and useful for what industry experts term “structured data repositories” and “managed content repositories.”
Another limitation of the current state of automated categorization technology is its lack of human-based feedback for the cognitively demanding aspects of categorization. For instance, while ENDECA has feedback loops for faceted navigation—including usage popularity to drive search result presentation and priority—it does not have a usage-based feedback loop to improve the semantic definitions and semantic relationships of the content.
Another major category of hybrid categorization systems may be described as large-scale collaborative categorization. This approach attempts to combine the cognitive advantages of manual categorization with the processing power of automated systems. Collaborative categorization systems in this emerging field are called a variety of names: “Web 2.0”, “collaborative categorization”, “folksonomy”, “social indexing”, “social tagging”, “collective intelligence”, and others. FLICKR™ (a photo-sharing community), DEL.ICIO.US™ (a social bookmarks manager), and WIKIPEDIA™ (the wiki-based collaborative encyclopedia) are examples of this emerging category of collaborative categorization.
In varying proportions, these systems use technology to provide a framework for wide-scale and distributed collaboration, while allowing the collaborators to make decisions about the categories, concepts, and relationships. One challenge to this approach is that it creates clashes between the guidance of topic and classification experts and the input of lay person end-users, who often have very different perspectives and categorization approaches to the content. These systems can help people collaborate by identifying areas of ambiguity and inconsistency, and by highlighting the competing opinions among the collaborators. But ultimately with a collaborative system, people should preferably reconcile their differences and come to broad agreement on the most slippery of terms. This process is thus difficult to scale and extend across large and varied information domains.
A leading example of the collaborative categorization approach is Metaweb Technologies, Inc., which aims to categorize wide-scale, open information domains by using a collaborative categorization approach to create a searchable database over the Web and other complex and varied information environments.
Metaweb Technologies has received much attention for its pioneering collaborative approach to creating the Semantic Web. Metaweb Technologies has filed 2 patent applications with the United States Patent & Trademark Office [United States Patent Application 20050086188, “Knowledge web,” Apr. 21, 2005; United States Patent Application 20030196094, “Method and apparatus for authenticating the content of a distributed database,” Oct. 16, 2003].”
Metaweb Technologies' collaborative ontology building relies on the “wisdom of the crowd” for its collaborative categorization. With it, end users define and extend multiple schemas that can be used by everybody. According to noted industry watch Esther Dyson, “Metaweb's creators have ‘intelligently designed’ the grammar of how the relationships are specified, but they are relying on the wisdom (or the specific knowledge) and the efforts of the crowd to create the actual content—not just specific data, but specific kinds of relationships between specific things.” [Release 0.9: Metaweb—Emergent Structure vs. Intelligent Design, Mar. 11, 2007, http://www.huffingtonpost.com/esther-dyson/release-09-met_b—43167.html] The limitation of this approach is that the database scope and quality is constrained by the semantic-related content inputted by its users. It also relies on the ability of experts and lay people to agree on specific data elements and specify relationships among content to eliminate redundancy so that the database contains definitive information.
Thus, there are many disadvantages with the current state of the art in automated faceted classification, automated categorization, and large-scale collaborative classification. Technologies are applied within or based on traditional methods. Enhanced classification methods are needed that affect fundamental changes to the structure of information.
For facet analysis, the input of human cognition is generally required, as there are no universal patterns or heuristics for facet analysis that work across all information domains. Presently, only humans possess the full breadth of pattern recognition skills. Unfortunately, structural patterns (such as semantic or syntactical structures) are generally required to be identified within the entire domain of information to be classified and there are many different patterns that may identify facets and attributes. While people can be trained to identify these patterns on small (local) data sets, the task becomes prohibitively difficult as the size of the domain increases.
Limitations are also introduced due to human involvement when the computational demands of the analysis and synthesis processes exceed the powers of human cognition. Humans are adept at assessing the relationships between informational elements at a small scale, but fail to manage the complexity over an entire domain in the aggregate. Systems are needed that are able to aggregate small, localized human inputs across an entire domain of information.
Faceted classification schemes enable multiple perspectives, an oft-cited benefit. Unfortunately, when these perspectives are fragmented across multiple hierarchies, they are not intuitive. This poses serious problems of visualization, integration, and holistic perspective. As the number of facets (or dimensions) in the structure increases, visualization becomes increasingly difficult. Consequently, visualizations of faceted classification schemes are often reduced to “flat”, one-dimensional result sets; structures are navigated across only one facet at a time. This type of reduction obscures the rich complexity of the underlying structure.
Methods and technologies are needed that combine the expressiveness and flexibility of faceted schemes within integrated and richly descriptive hierarchies. Moreover, this flexibility optimally extends down to the fundamental level of the classification scheme itself, in a dynamic construction of facets as organizing bases.
Once selected, the facets themselves are static and difficult to revise. This represents a considerable risk in the development of a faceted scheme. Classificationists often lack complete knowledge of the information domain, and thus the selection of these organizing bases is prone to error. Under a dynamic system of classification, these risks would be mitigated by the ability to easily add or alter the underlying facets. Traditional methods of classification and derivative technologies lack flexibility at this fundamental level.
Any classification system may also consider maintenance requirements in dynamic environments. As the materials in the domain change, the classification may adjust accordingly. Maintenance often imposes an even more daunting challenge than the initial development of the faceted classification scheme. Terminology must be updated as it emerges and changes; new materials in the domain are generally required to be evaluated and notated; the arrangement of facets and attributes are generally required to be adjusted to contain the evolving structure. Many times, existing faceted classifications are simply abandoned in favor of whole new classifications.
Hybrid systems involve humans at key stages of analysis, synthesis and maintenance. Involved early on in the process, humans often bottleneck the classification effort. As such, the process remains slow and costly. Systems are needed that accept classification data from people in a more decentralized, ad hoc manner that does not require centralized control and authority. These systems may support implicit feedback mechanisms, wherein the very activities of information access and information consumption provide positive support for the maintenance and growth of the classification scheme.
To guide the process, hybrid systems are often based on existing universal schemes of faceted classification. However, these universal schemes do not always apply to the massive and rapidly evolving modern world of information. There is a need for customized schemes, specialized to the needs of individual domains.
Since universal schemes of faceted classification cannot be applied universally, there is also a need to connect different domains of information together. However, while providing the opportunity to integrate domains, solutions ought to respect the privacy and security of individual domain owners.
The sheer magnitude of our classification needs requires systems that can be managed in wide decentralized environments involving large groups of collaborators. However, classification deals in complex concepts, with shades of meaning and ambiguity. Resolving these ambiguities and conflicts often involve intense negotiations and personal conflicts which derail collaboration in even small groups.