1. Field of Invention
The present invention relates to a post-Gutenberg reading experience and, in particular, to a method of generating a document representation, to a related computer program product and to a related system.
2. Description of Related Art
a. Introduction
The World Wide Web is having a profound impact on humankind. Because information in many forms from many sources is made instantaneously available around the world to anybody with an internet connection, our perceptions of space, time and knowledge are changing. The internet, as we all concede now, has profoundly changed the functioning of human societies across the world. At the center of this change is the democratization and globalization of information by way of electronic documents. In this regard, the internet of the mid-1990s is likened to the Gutenberg printing press of the Renaissance—a profound catalyst in the development of human thought and knowledge.
And the Web continues to evolve, from a strict separation of information producers and consumers that communicate with each other using simple wire and message protocols (HTTP/HTML) to communities of information prosumers that exchange information using application programming interfaces (APIs) known as Web Services. The current state of affairs, which is sometimes called the “read/write Web” or “Web 2.0”, is characterized by the ever-present Web as a platform, data centrism, the use of XML/XML Schema as a standard for data exchange between machines, a plethora of collaboration models (Wikis, Folksonomies) and rich client technologies (AJAX, Microformats, RSS feeds, etc.). It is said that a “collective intelligence” is embedded in the Web, that “the growth of the Internet over the last several decades more closely resembles biological evolution than engineering”.
b. “Drowning in Information, Starving for Knowledge”
However, the globalization and democratization of information enabled by the Web poses tremendous challenges for both individual and community. It is not just that the amount of information is overwhelming; it is also unstructured and cacophonous. Without a formalism that a machine can understand and without an assignment of semantics, information cannot be processed by a machine in any meaningful way, leaving Web users “drowning in information, starving for knowledge”. The problem of information overload is quite insidious, and forces many people to rely on a remarkably small number of information sources as a coping mechanism.
c. Semantic Web
The goal of the Semantic Web is to enable machine understanding so that information becomes knowledge, that this knowledge is understood by both human and machine and that machines talking to machines enhance the human experience on the Web.
A first and far-reaching step towards the Semantic Web is the Resource Description Framework (RDF), an open standard recommended by the World Wide Web Consortium (W3C) as early as 1999, and revised in 2004. The purpose of RDF is to facilitate knowledge representation and information modeling via directed graphs. The result is that RDF has become an assertion semantic network, a product of artificial intelligence work in the 1980s. As such, RDF is a form of knowledge representation: it allows statements about web resources to be made. Statements take the form of a triple: subject-predicate-object, and every element in a statement references either a Uniform Resource Identifier (URI) or a literal.
RDF, replete with syntax but not semantics, is a necessary but not a sufficient condition for realizing the Semantic Web. Semantics requires the ability to model the real world, to formulate abstractions about entities and their relations. To assist in reaching this goal, the W3C recommended OWL (Web Ontology Language) in 2004. OWL is built upon RDF Schema, which in turn is expressed in RDF. There are three OWL variants: OWL Lite, OWL DL and OWL Full. The three variants provide for increasing expressiveness and commensurately decreasing computational completeness, respectively.
d. Semantic Web Inertia
While the Semantic Web has made great progress in some problem domains (medicine, bio-informatics, life sciences), and remains critical in many ways to data and system integration (grid computing, enterprise application integration), it has yet to obtain wide, popular acceptance. Lack of acceptance stems from the intrinsic difficulties of implementing the Semantic Web. In particular, the formulation and use of knowledge with respect to correctness, coherence and completeness is extremely hard for most people. Some argue that knowledge by nature is neither machine tractable nor fully amenable to deductive logic; that real-world knowledge can never be truly and fully captured; that striving towards correctness and completeness of semantics result in overly complex, poorly understood and ultimately counter-productive systems; that shared views are hard to create and that enforcing shared views does “more to debase semantics than create informative connections”.
e. Limitations of Folksonomies
The folksonomies of Web 2.0 are not an answer to the inaccessible formalisms of the Semantic Web. Collaborative tagging or social classification provides metadata, but this metadata has few or no restrictions. So, while it may be true that “tags are causing context to explode”, enabling more sophisticated searching, the dearth of rules means lack of structure and organization, sparse semantics, and little or no enablement of the machine's ability to understand.
f. Perspective Change
The crux of the problem distills to this: is there something between folksonomies (tag clouds) and ontologies (formal specifications of conceptualizations) that can help the user, with machine assistance, to function effectively against an overwhelming tide of information on the Web?
Without some sort of formalism, machine interaction is not possible; with too much formalism, human interaction dissipates. One way to decrease the demands of formal knowledge representation is to not insist on sharing knowledge while continuing to share information. Personal knowledge versus shared knowledge on the Web is a more tractable problem. This change in perspective reduces the criticality but does not obviate the need for formal knowledge representation or the utility of ontologies.
g. Personal Knowledge as Semantic Network
With respect to personal knowledge, an exceedingly simple, elegant and effective way to represent knowledge that is both human and machine understandable is a semantic network. A semantic network is a form of structured data, the data comprising concepts and relations between the concepts, e.g., in the form of concept-relation-concept (CRC) triples. Semantic networks per se (i.e. as a topic of research in the field of linguistics) are well known in the art.
“Semantic nets” were first invented for computers by Richard H. Richens of the Cambridge Language Research Unit in 1956. Concept maps, developed by Joseph D. Novak in the 1970s, attest to the appropriateness of semantic networks to human learning and understanding. John F. Sowa is credited with research on conceptual graphs and semantic networks as formal knowledge representation in the 1980s. On a technical level, RDF and Topic Maps (ISO/IEC 13250) have been public standards for the representation and interchange of knowledge as semantic networks since the late 1990s.
h. Patent Literature
U.S. Pat. No. 6,263,335, entitled “Information Extraction System and Method Using Concept-Relation-Concept (CRC) Triples” discloses an information extraction system that allows users to ask questions about documents in a database, and responds to queries by returning possibly relevant information which is extracted from the documents. In this context, concept-relation-concept triples are built from unstructured text.
US 2008/0133213, entitled “Method and System for Personal Information Extraction and Modeling with Fully Generalized Extraction Contexts”, focuses on concept-based extraction and modeling in a way to concisely control both the terms that are being sought, and the context in which they are sought. The underlying principle hinges on the pairings of resource-context (R:C) and extractor-concept (X:K), such that X [extractor] is an algorithm that identifies the instances of K [concept] within C [context] in each resource r in a resource set R. This has several direct consequences:    (i) The system requires an a priori model to start. A model is an unordered set of concepts, absent the notion of any root or starting concept. The user can manually generate a concept model, but better would be an import of a project with a previously defined model, meaning concepts and their corresponding extractors. At any time, a user can add, remove or modify concepts in a model, thereby creating a model of personal information.    (ii) The relationship between X and K in the underlying principle requires that X have some sort of a priori knowledge of K, in order for X to fulfill its goal of K identification. Managing more than one concept at a time requires that “boolean operations be used on extractors to create extraction patterns”.    (iii) The basic modus operandi of the system requires the user to work with concepts in a serial manner when extracting (X) concept (K) relevant information from a corpus of documents (R:C). In order to identify all instances of many concepts at once in a document corpus, the user would be obliged to use a priori knowledge to combine extractors through “boolean operations”.    (iv) The primacy of context (C) makes the overall application document (R) centric. This is apparent in the statement that one of the key objectives in the system is the identification of all instances of a concept within a document corpus. Hence, the document relates to both “a method for annotating a set of documents in a model of information” and “a method of determining a trigger phrase related to the set of documents and defining the set of concepts and corresponding extractors based on the trigger phrase.”
In sum, the a priori requirements of this patent application publication US 2008/0133213 and its process of structuring unstructured text into a pre-defined model using a generalized technique, wielded by the user, appear similar in nature to the Message Understanding Conferences (MUC) on information extraction and template completion. This observation remains true for the case when there is no previously defined model and the application produces a suggestion list of noun phrases from the project's corpus of documents using conventional information extraction techniques (lexical parsing, N-gram analysis and the like).
i. Electronic Readers
Although we live in what has been called the “post-Gutenberg era”, we have not yet created the tools which allow us to read electronic documents with any more efficiency than their paper counterparts. Indeed, the online reading experience is so inferior to the reading experience of printed material that many people print electronic documents that are more than two or three pages in length.
The problems associated with the online reading experience have spawned an industry of dedicated electronic readers. Most of these readers attempt to emulate the physicality of paper, especially with respect to layout, background lighting, font-size, etc. These readers also have rudimentary searching, annotation and navigation abilities. The latest reading devices exploit multi-media capabilities. Of course, many of these reading features can be implemented by software running on general computing devices. The e-reader, however, is designed to facilitate the human ergonomics of reading a book.
The e-reader emphasis on reading as a material activity is inherently limiting—it aspires to no more than what is achievable by paper. Such an approach remains rooted in a Gutenberg mindset, not at all appropriate in a post-Gutenberg world of information where hundreds if not thousands of relevant electronic documents are accessible with the right search request.
The online reading experience, therefore, is not only subject to form but also substance. It is the volume of information, often contradictory and incomplete, which challenges our conventional reading skills and habits. In effect, the more information that we are exposed to, the less we understand. It is the nature of being human, and not a machine, to have limited working memory and to experience diminishing returns with every new bit of information that we are exposed to.
Presently, there are no adequate post-Gutenberg tools, if any at all. An object of the present invention is to address the above shortcomings. It is an object of the present invention to use semantic networks to represent knowledge, most especially personal experience and knowledge, and that the availability of said knowledge facilitates human-machine collaboration, thereby changing how people interact with text, in whatever form.