The brief description of the problem area is restricted to the two-sided problem facing the user instructed to investigate documents. In this situational context she knows from experience that some particular parts of the documents are more noteworthy for the interpretative task she is engaged in, and that other parts may be considered as more superfluous. Further she finds it difficult to rapidly locate the important parts in a context characterised by time pressure.
The present invention is founded on the assumption that improved indexing engines, search engines, and other tools that are developed as a response to search related problem are not adequate for the performance of interpretative tasks involving in-depth investigations of texts. These tools are primarily seen as addressing another type of problem, commonly denoted as ‘information overflow’ and where the goal is to detect a possibly useful subset of documents from www accessible collections comprising millions of documents.
Search systems of the prior art are, in large, based on the so-called ‘traditional model of information retrieval’. This model is thoroughly characterised and discussed in information retrieval literature. A quote extracted from Blair (1990) indicates the principal features of the problem in focus: “ . . . the traditional model of information retrieval which stipulates that the indexer's (or automatic indexing procedure's) job is to accurately describe the content and context of documents, regardless of how the inquirers might describe that content, and the inquirer's task is to guess how the documents he might find useful have been represented. This is the normative model of information retrieval and it is implicit in most information retrieval models.” (1990:189).
The provision of the right information and saving time puts pressure on better acquisition procedures and as the amount of information that is available is steadily growing, the burden on indexing devices also becomes higher. This is commonly denoted as the ‘scaling problem’.
The present invention looks beyond information dissemination as merely making more information available. The present invention presupposes intermediaries in the organisation (user community) who gather documents from various sources. Acquisition, segmentation, disambiguation and the underlying indexing principles are of critical importance for effective dissemination, searching and use of document collections. The answer to the specific problems addressed is not found in smarter search algorithms or so-called intelligent agents per se, although new functionality and new visualisation techniques might help. The answer embedded in the present invention is to get the user closer to the content by new representational means and a new set of tools interfacing these.
The challenge is to transform the essential documents into a system that differentiates between document types and construct representations of texts extracted from documents in a manner that attract the users' attention. The textual content has to be transformed and reduced into a form that makes the content accessible with less effort and time expenditures. Special designed services will add value to the content representations by applying a particular apparatus for zonation and an apparatus for filtering that delivers results in an interface, preferably denoted as a text sounding board.
The Principle of Text Driven Attention Structures
In order to explain this principle, it is necessary to include a brief reflection on the concepts of ‘meaning’, ‘understanding’, and ‘context’, and with reference to characteristics of text. Preferably this reflection relates to genres as argumentative text, directive texts and narratives.
First of all, the comprehension of, and therefore the definition of, the concepts ‘meaning’ and ‘content’ is dependent on that of ‘context’. The environment of the present invention constitutes two substantially different parts: 1) Authors situated in a situational context and, for some reason, produce documents that reflect some of the features of the situational context and as perceived by the authors. 2) Users situated in another situational context and, for some reason, have to confront themselves with documents in order to read and interpret the ‘meanings of the author’ mediated in text.
Authors who are situated in particular situational contexts produce documents, and other actors who are situated in quite different situational contexts use documents. Even if the author and user happen to be the same person, the situation at the time of writing will be different from a later situation of exploring and reading. The user may perceive/understand one ‘meaning’ from the text's content in one situational context, but seeks for another ‘meaning’ from the same content in another situational context. The situational context is for instance influenced by the work task at hand, time available, background knowledge, etc.
‘Meaning’ and ‘context’ appear in varying situations, but are still mutually related and relative. To situate some words or word constellations within the text's inner context, i.e., within the context of the text itself, will lead the user's attention to a certain place (‘locus’). However, for the user to understand the visualised place as meaningful, she also has to understand it as meaningful in the situational context in which she operates, i.e., why she finds it necessary to explore and read documents.
The concept of ‘meaning in context’ cannot be defined properly since it denotes a kind of circularity of enclosure. The interrelationships between ‘meaning’ and ‘context’ can be expressed if ‘context’ is seen as levels of enclosure. Thus the words' inner context is the words in the surrounding area and preferably with the document from which the text is extracted seen as the edges of the inner context. A particular text has also an outer context, also textual, as defined by other documents in some way related to the situational context in which they were produced. The situational context is the world ‘outside the text’ and each text always reflects more or less, one or several authors' interpretation of this ‘outside world’.
A user situated in a different situational context can thus be made aware of some features related to the author's interpretation mediated in text. A text driven generation of text zones reflects some of the features and as related to how the author's focus of attention moves and shifts across the text's collection of sentences. Thus text zones provide for artificial horizontal sub-contexts, i.e., horizontal in that sentences follow each other in sequences, at least within the cultural environment of the present invention.
The text zones reflect particular patterns of repetition which when taken together with words not in particular repeated within a zone, builds up structures of attention originating from the author. The patterns of repetition encompass several textual features at different levels, i.e. not only lexical features but features related to grammatical form as tense and modality, and superordinate argumentative functions signalling particular discourse elements.
A text zone is an artificial or derived horizontal sub-context (within the inner context), giving the background information for a particular word occurrence or word constellation. This background information affects the ‘meaning’ of the word occurrence as determined by the author. Likewise, the background information affects the ‘meaning’ of the word occurrence as understood by a user in a totally different situational context. The background information can be as significant as the particular word occurrence when the user decides whether the ‘meaning’ or ‘content’ is useful in that particular situational context.
Consequently, the notion of ‘equality’ between words, either the very same word or its synonyms or near-synonyms, is by definition an ambiguous concept. Equality or sameness refers to word occurrences and by some schools of thought, purporting to refer to the ‘same entity’ in the situational context, i.e. supposed to exist in the world outside the text. The present invention is based on the assumption that even the ‘same entity’ will be perceived differently and that this perception again varies with context. This ends up with an assertion that there are no criteria for determining equality between the very same words occurring in varying contexts. It will therefore not be possible to construct a description for a word and its interconnections to other words that is detached from context, and thereafter apply the description for the same word occurring in various contexts. Since the identification of text zones is dependent on the identification of word occurrence and how they repeat in patterns of fluctuation, the users' recognition and understanding of the word occurrences will be dependent on the text zone in which the word occurs, i.e., the word occurrences' situated background information.
This brief reflection explains why the present invention does not rely on, or is cautious about, the application of general thesauri (such as the widely used WordNet) or semantic networks as for instance conforming to the syntax defined for Topic Maps (ISO/IEC 13250).
The present invention instead relies on a method and system for establishing relations between word occurrences, and with respect to the words' inner context. This explains the principle of text drivenness in which the text itself gives the necessary background information for the generation and construction of relations between words, where bundles of relations form text zones reflecting how the author's attention moves across the text. When these text zones and particular word occurrences within zones are visualised in a preferred interface, the users' attention will be directed towards these structures seen as horizons virtually superimposed on underlying grammatical encoded texts. The phrase ‘virtually superimposed’ refers to the fact that the structures are not encoded in the text, rather they are managed in a system of external files and a device that transmits derived information and displays it in a text sounding board. By operating on this text sounding board, the user can directly influence the device that constructs attention structures reflecting the users' explorative moves.
The key concept is that of text driven attention structures reflecting aspects of the authors' attention in their perceived situational context at the time of writing. The concepts of insight, chance and discovery covers for the user, and reflecting the knowledgeable user confronted with the texts made accessible via a text sounding board, and where the user operates in totally different situational context. (The concepts of insight, chance, and discovery are borrowed form the ancient legend about the Three Princes of Serendip, and as told in Remer (1965).
The Users' Problem
The user's main problem is to express her ‘information need’. The problem for the user is related to the indexing devices (in a continuum from controlled to free-text indexing), and not so much related to the system's search functionality. (The concept search functionality refers to the implementation of how the system matches the user request against representations of documents in the system and how the system calculates/presents the items most likely to satisfy the user's need).
The main problem for the user is related to the user's ability to express her ‘information need’ as a request submitted to the search system. The search request is a search expression composed of a set of search terms and search operators. The search expressions are indirect in that the searches are not executed in the text itself, but in index structures that is supposed to represent to the text content (text content surrogates). The search system compares the constellation of terms in the search expression with the system's index terms (document representations or document vectors).
The index terms in a search expression may be combined in a seemingly infinite number of ways and the user will experience uncertainty whether documents are indexed with the terms included in the search expression. Surely, in all information searching, there is an investment of time. Advanced indexing devices aim at reducing search time by trimming the search space. However, the point made is that the user will meet the same type of problem regardless of whether the index structures contain so-called free-text terms or terms from a controlled vocabulary (index terms using the notation form a classification scheme which in fact is an extreme form of summarisation). The index structures may be restricted to chains of nominal expressions and concepts may be related by simple semantic links (synonyms, etc), arranged in hierarchical structures (broader terms and narrower terms). However, these relations are always much weaker than the original textual semantic relations that incorporate textual coherence.
The Search Process is a Linguistic Transformation Process
Empirical investigations reveal several factors explaining the user's incapability to express their information need in an accurate manner so that the system produces a result covering the information needs (normally the discussions differentiate between goal-oriented searches and interest oriented searches). The user is in a situation in which she has to balance two quite different goals: First of all, she has to predict how supposedly relevant parts of the text are represented in the index system. Secondly, she must formulate a request that retrieves a number of items (documents or text segments) that is adequate with respect to the amount of resources she has available when judging the items' usefulness.
When performing a goal-oriented search in a domain-specific, rather small-scale document base, the user needs a possibility to explore available index terms in order to deliver an accurate request to the system. A search result of, let say, 100 to 1000 items (or more) is in some situations of no value to the user. The number of items in the result list exceeds the user's futility point, or the user's capacity to browse/read in order to find information accepted as useful.
A lot of factors influence on the user when she is trying to formulate a ‘best match query’ (background knowledge, data base heterogeneity, etc). This process is in fact a linguistic transformation process where the user has to transform her ideas about an information need to a chain of nominal expressions. On the other side, document content has been transformed in another process resulting in lists of isolated concepts.
An isolated term or concept is a word that cannot, in isolation refer to the meaning mediated in the text (Ranganathan 1967). (An isolated concept can be a component in a compound subject in turn being a part for a complex subject.) This assertion covers both indexes resulting from automatic indexing procedures or so-called independent subject analysis. Semantic relations that occur in the text cannot be expressed in the index (as opposed to semantic relations encoded in for instance thesauri).
Why Does the User's Request Fail?
A search request may fail for a number of reasons (the request fails when the system delivers a result that the user finds unsatisfactory). The following list gives a simple overview of some important causes related to the use of terms (words, expressions) in the search requests.
Terms are left out (excluded), perhaps because the user assumes that they are not present in the system's index structure or that she assumes them to be of no relevance in a search request or that she believes that certain terms do not have a sufficient discriminating ability.
Terms are included because the user thinks that certain words are present in documents or represented in the index structures. Automatic procedures can remove such terms and/or replace them by classifying them as members of a semantic class in a thesaurus. Replacements may be in conflict the user's intention or the idea the user is trying to express through a set of terms (however, systems supporting this option normally ask the user to confirm term replacements).
The user selects terms referring to words that are used at present (new or popular terms) or words related to a specific domain (professional language). Documents of potential relevance may be indexed with terms that are different from those used at present but referring to the same meaning. Thesauri inquiries may establish term accordance (terms in request and terms in index structure). This strategy however, increases the search scope (involves the operator OR) and thereby the result list may exceed the user's futility point.
The request includes too many terms or terms combined with operators that exclude potentially relevant documents (text segments). Empirical investigations indicate that users are reluctant to alter or remove the first 2-3 terms in a combined list. Automatic procedures can adjust the sequence of terms and/or give terms weights according to their position in a list. If the user considers the first terms as more important than the others, these automatic procedures may conflict with the user's intention. The request includes terms at an abstraction level different from the terms in the index structure. In more advanced systems the user is given the option to select broader or narrower terms. Alternatively, the user can choose operators that move downwards or upwards in a term hierarchy. Depending on the thesaurus, the search scope may accordingly be too large or too narrow with respect to the user's search intention.
Several Failure Causes may be Present in One Request
The user's linguistic transformation problem is that several of these ‘failure causes’ may be present in one search request. The user has no possibility to evaluate her search request with respect to terms available in the index structures. The index terms are ‘hidden’ in that the user only can perceive fragments (if the system at all offers options for looking into the index system).
The problem convey some resemblance with a situation where to persons are trying to dialog by talking two different languages (the user's natural language transformed into a chain of terms and the system's documents transformed into an index structure with isolated terms without relations). The user is in a situation where she tries to learn the system's language in order to achieve a goal (satisfying an information need). However, the learning of a new language presupposes feedback about why a certain expression does not produce a satisfactory search result. No system (yet) provides feedback explaining why the search request failed—a complicated feedback if several of the mentioned ‘failure causes’ are present in the same request. Since the user cannot inspect the system's language use, she will not be able to correct her own language use when formulating search requests. The only available strategy is to proceed tentatively (trial en error) in every new search situation (new tasks with new information requests).
Systems of the prior art embody various proposals aiming at constructing diagnostic devices analysing the user's requests as compared to the results the user evaluates and marks as relevant. Such diagnostic devices seem to have problems dealing with the fact that language use is a dynamic entity “whose times of greatest dynamism and change may come in the very process of interacting with a retrieval system” (Doyle 1963).
The Present Invention's Solution Proposal
As early as in 1963 Doyle considered the role of relevance in information retrieval testing and concluded: “The gradually increasing awareness of human's incapability of stating his true need in a simple form will tend to pull the rug out from under many information retrieval system evaluation studies which will have been done in the meanwhile.”
Doyle argued that the solution to this problem was not to design systems around the concept of relevance, but to base design on the concept of exploratory capability: “the searcher needs an efficient exploratory system rather than a request implementing system”.
With reference to this quote, the inventor of the present invention therefore basically, addresses the user's problem related to formulating queries and providing feedback about to what extent the request matches the actual content in the documents/texts. A context-dependent and situated content representation takes into account the actual situation of the user. The assumption for the present invention is a domain-specific document collection evaluated as worth delivering to professionals within a certain user community.
Rather than relying on the user's capability of expressing information needs in an accurate manner, the system should provide the user with mechanisms that reflect the actual content in the document collection. The representation of document content must attend to the economy of time and more costly techniques are justified in terms of offering the user advanced options for exploring text in order to discover text zones that are useful in a given situation. The percentage scores of current search engines are, in this context, entirely inadequate measure of a system's value for the user. This problem is sought solved by incorporating new text theories and language technology into the field constructing system's selectivity. The apparatuses for segmentation and disambiguation perform essential pre-processing of the texts in order for other apparatuses to construct the preferred selectivity embodied as attention structures supporting individual behaviour during text exploration and navigation.
The interconnected apparatuses as outlined in FIG. 1 provide for a new type of selectivity. The particular apparatus that visualises grammar based contacts to the texts prepared for investigation will be explained in more detail below. The interface that displays these contacts is preferably denoted ‘text sounding board’ and provides a kind of ‘decision support’ in that it exposes the texts' content to the user and she is free to select her own moves by operating the content of the text sounding board. Her moves and actions are immediately mirrored in the interconnected text pane as illustrated in FIG. 5.
The selectivity of the present invention incorporates and supports:                Grammatical information derived from CG-taggers        Semantic information and the transfer of techniques related to thesauri construction        Pragmatic information related to text understanding and features related to the situational context        Statistical information derived from applying a reference corpus and computing keyness, and keyness of keyness        Frequency information combined with grammatical information in relation to interconnected documental logical object types        Zonation and filtering realised as intersecting chains, which embody the various types of information, outlined aboveSearch Engines do not Solve this Particular Problem        
Despite all the work on search and indexing engines over the past 50 years, the problem of classifying, indexing and retrieving digital content remains a major problem for unstructured data such as text. Search and indexing engines (as Lycos, Google, AltaVista, InfoSeek, etc) proposes to solve the problem of finding information by constructing indexes from information sources available on the World Wide Web. Oversimplified, this is done by tracing hyperlinks and parsing the pages these hyperlinks point to. The URLs are maintained as entries in global index tables that these engines create and the pages referenced by the URLs can be retrieved in reply to a search request. Information filters propose to solve the problem of information overload in that they synthesise previous user requests into categories that are regularly invoked to operate on information streams.
Traditional search systems rely on different indexing devices and different indexing languages vary in the extent to which they use single or compound terms and hierarchies, whether index terms are controlled for synonyms or homographs. Free-text indexing devices are often combined with controlled vocabularies (assigned keywords). The user can normally restrict their search scope to certain fields (catalogue elements or Dublin Core Elements such as title, author, date of publication, headers, abstracts, and so on) and/or to certain document types. Typical search options are simple searches, category searches (index terms are arranged in controlled hierarchies). More sophisticated systems support GREP searches (Get Regular Expressions) which control the matching process based on ‘special characters’ included in the search string and various types of proximity operators. The employment of statistical and probabilistic techniques is a broadly accepted quantitative framework. However, limitations of the statistical approach are still claimed with reference to various retrieval performance metrics of systems employing statistical techniques is still (in absolute terms) low.
The Indexing Problem
As mentioned, index structures constitute a system of representations. The concept of representation by definition means that some information is left out. In order to ensure that the loss is not crucial with respect to information searches, the indexing strategy should focus on which information is expendable and which is not. In the following, some principal issues are shortly described. Indexing and classifying (indexing here: using the notations of a classification scheme) appear as a special profession and are often seen bound to retrieval necessities. Since indexing is bound to technical use in information retrieval, indexers (persons or programs) must strictly consider a set of representational prescriptions. The myriad of indexing strategies can be positioned according to combinations of a wide range of dimensions. Search engines operating on index structures to a varying degree include techniques for integrating (compare, weigh and merge) the index terms across databases. Representing textual content with compliance to prescriptions may explain the cause of several problems related to retrieval issues.
First of all, prescriptions set the requirement for the index terms, thus it can be the source of the ‘isolated’ descriptors assigned to the document at the cost of the textual formulations which may be the best discriminators in a given search situation.
Secondly, different indexing strategies result in different index terms for the very same textual content (extracted from a document), known as the inter-indexers' consistency problem, and the problem exists whether the indexing is performed by a human or a machine.
The tuning of index terms based on statistical information (word weighting procedures) may further obscure textual nuances that have a discriminating search effect. For instance, it is assumed that highly professional authors use a richer vocabulary than more inexperienced authors. Lexical style (influenced by personal, social, cultural, and other contextual factors) reflects the author's choices among immense variations in word constellations used to express more or less the same meaning. Words like ‘lawyer’, ‘attorney’, or ‘solicitor’ are variations in lexical style; however, the textual context may reveal deeper semantic variations. Such simple linguistic variations may be captured in indexing devices with synonymy relations derived from thesauri. The problem escalates when considering the fact that similar meanings may be expressed through sentences having different syntactic structure or word constellations that paraphrase single-word terms (‘diseases of children’ in stead of ‘paediatrics’).
The issue about lexical style is related to another indexing problem. Selecting the ‘right’ words from a classification system or thesauri can be quite complicated when indexing documents with an ‘unexpected’ or innovative content. New terms not covered in the classification system have to be projected into existing terms or the indexer has to extend the classification system so that it reflects the new terms. This latter case requires human intervention (independent subject analysis), and in principle also requires a professional indexer with lexicographic competence.
These and related problems explain the viewpoint taken by Langridge (1989): “At present the potential of computers is largely wasted because they are merely used as a medium for inferior indexing methods.” Blair goes even further and claims: “To see the information problem as a computer problem is to confuse physical access with logical access, or to confuse the tool with the job.” (1990:70). The concept ‘logical access’ in information retrieval refers to issues related to reducing the number of logical decisions the user must make when searching for information.
Focus is on how to identify and represent textual content in a text-driven fashion and provide representations visualised in a text sounding board. These representations are the logical access points for the set of texts of potential interest to users and constitute. The present invention also includes a rich set of options giving the users the opportunity to conduct text exploration and text navigation based on the constellations of access points visualised in the text sounding board.