1. Field of the Invention
The present invention relates generally to computer-implemented information retrieval systems and methods and, more particularly, to such systems and methods for identifying and displaying information based upon the linguistic content of the information.
2. Related Art
Increasing collection and exchange of computer-readable or computer-accessible documents, such as electronic mail, technical documentation, publications, notes, correspondence, and so on, require improved methods and systems for retrieving particular documents and efficiently displaying them to a user. Various conventional techniques have been developed to search through collections of documents and display the results. However, these conventional techniques have many drawbacks that limit their effectiveness in allowing a user quickly and intuitively to find and examine documents of interest.
One group of such conventional techniques involves searching documents by keywords, topics, titles, or other subject-matter indicators inserted in the documents by their authors. Such techniques generally limit the documents found by a user because the user typically must choose the same, or similar, keywords as those chosen by the authors. There is thus no assurance that a document pertaining to a particular subject matter will, first, be described by the author by the use of a particular keyword, or by any keyword at all, or, second, that the user will choose to search by a keyword chosen by the author. Also, such techniques often require repeated inquiries by the user until a desired document is found. Each such inquiry possibly may require significant effort to devise appropriate keywords or combinations of keywords, and the user generally cannot be assured that the most appropriate keywords or combinations have been tried. Methods for combining keywords may be unintuitive, and unique to each search mechanism. Repeated inquiries may consume significant amounts of time, and it may be difficult or impossible to alter previous steps to change the current search results without repeating the entire process. In addition, such keyword-searching techniques typically do not display information about the subject matter of the entire group of documents that is being searched, as contrasted with information about the particular documents in the group that satisfy the search criteria. Similarly, such techniques often do not present the results of searches, or repeated searches, in a manner that enables a user quickly, efficiently, and intuitively to compare the documents retrieved by one search with documents retrieved by another search in order to choose the most promising direction for further searching.
Other conventional systems or methods may attempt to display limited information about the subject matter of the groups of documents that are being searched. Such systems or methods may allow a user to select among a list of keywords, topics, titles, or other subject-matter indicators. Such list may be presented as an index, for example. However, in such systems, it is not provided that such list in fact describes the subject matter of the particular group of documents being searched. Rather, the list may consist of a predetermined group of subjects that are presumed to describe the content of representative collections of documents in general, or in particular subject areas. Other lists may include author-supplied descriptors, but, as noted, various authors may not use the same keywords to describe the same subject matter, or may not use keywords that a user would look for, or recognize, as being descriptive of a desired subject matter.
Still other conventional systems or methods may apply limited linguistic analysis to a group of documents in order to attempt automatically to provide information about their subject matter; that is, without relying on author-supplied keywords. For example, such systems or methods may attempt to identify proper nouns that are categorized by comparing them to a dictionary of proper nouns. Such systems or methods typically have significant limitations, including the inability to identify recently coined proper nouns used, for example, in quickly evolving technological fields. Also, certain parts of speech, such as proper nouns, may be systematically underrepresented in certain types of documents, such as is often the case with respect to proper nouns in technical documentation. Further, such systems or methods may not be capable of distinguishing among various uses of the same proper noun. For example, the proper noun xe2x80x9cMadonnaxe2x80x9d may be categorized as pertaining to music or religion, rather than to visual art, because the system or method does not analyze the full morphological and syntactic context in which the proper noun appears.
With respect to all such conventional systems or methods, a user generally may not efficiently and intuitively identify from an initial collection of documents a sub-collection of documents that are likely to pertain to a subject matter of interest. Similarly, a user generally may not efficiently and intuitively further identify a sub-sub-collection of the original document collection, and so on, until a manageably small number of documents remains to be examined. Moreover, information displayed to a user about the subject matter of a collection of documents generally is not presented in an efficient and intuitive manner such that the user may readily determine whether such collection of documents contains a subject matter of interest, or how such desired subject matter relates to other subject matter contained in the collection of documents.
Accordingly, what is needed is a system and method that comprehensively and automatically (i.e., without relying on keywords or other subject-matter indicators inserted by authors) displays to a user the subject matter of a collection of documents, and enables a user intuitively and efficiently to find sub-groups of such collection containing subject matter of interest. In particular, what is needed is a system and method that efficiently displays information about the subject matter of the groups of documents that are being searched. Also, such system and method should enable a user quickly, efficiently, and intuitively to examine and alter the display in order to compare the documents retrieved by one search with documents retrieved by another search, or to successively narrow a search, in order to choose the most promising direction for further searching or to display desired documents.
The present invention is a computer-implemented information analysis and display system and method that dynamically generates and displays topics representing a linguistic content of documents in a file system. In accordance with one aspect of the invention, referred to as a linguistic filter, the documents are user-selected. In accordance with one aspect, the user operates a user computer to select one or more of such dynamically generated and displayed topics, preferably using a graphical user interface. In some embodiments, the linguistic filter displays document identifiers corresponding to those documents that are described by one or more of the topics selected by the user. In such, and other, embodiments, the linguistic filter displays the place or places within a document, or group of documents, at which are located linguistic content giving rise to one or more selected topics.
In one embodiment, the file system is local to the user computer; that is, it is located within the user computer or directly connected to it. In an alternative embodiment, the file system may include one or more file systems that are remote to the user computer; that is, the remote file systems are connected to the user computer through a network, or networks of networks.
In one embodiment, the linguistic filter of the present invention includes an interface manager, a linguistic topic analyzer, and a display manager. The interface manager retrieves selected files from the file system and generates graphical user interfaces to display document identifiers and topics generated by the linguistic topic analyzer, and to receive user selections of files or topics. The linguistic topic analyzer generates the topics representing the linguistic content of the documents based on morphological and syntactic evaluation of the documents. The display manager displays the document identifiers of all, or of a user-selected portion, of the documents so analyzed by the linguistic topic analyzer. Also, the display manager displays those documents having a linguistic content represented by one or more user-selected topics. In one implementation, such user-selected topics may be combined using boolean operators. In one embodiment, the display manager displays the place or places within a document, or group of documents, at which are located linguistic content giving rise to one or more selected topics.
In one embodiment, the linguistic filter also includes a language identifier. The language identifier identifies the natural languages of the documents. In some implementations, a user advantageously may select for display only those topics representing the linguistic content of documents that are written in one or more user-selected natural languages. In some implementations, topics of documents written in one natural language may be displayed in relation to such natural language, topics of documents written in another natural language may be displayed in relation to such other natural language, and so on.
In one embodiment, the interface manager includes a graphical user interface (GUI) interpreter, a GUI generator, and a file folder retriever. The GUI interpreter receives information regarding a user""s selection from a graphical user interface, and directs such information to other modules of the linguistic filter of the present invention, including the file folder retriever. The GUI generator generates graphical user interfaces for displaying information to the user and for enabling the user to make a selection from such displayed information. The file folder retriever retrieves selected files containing documents (thus referred to as selected documents) from the file system, identifies a document identification for each document in such files, and stores the documents in those files into a document buffer. In an alternative implementation, the file folder retriever may store in the document buffer pointers to the documents in the selected files rather than the documents themselves. The selected files preferably are user-selected, and thus the documents therein are also user-selected.
In one embodiment, the linguistic topic analyzer linguistically analyzes the selected documents to dynamically generate a data structure including topics and topic modifiers, such data structure referred to as a topic tree data structure. In one implementation of such embodiment, such topic tree data structure also includes occurrence records related to such topics and topic modifiers. The term xe2x80x9coccurrence recordxe2x80x9d refers to a record that includes a direct or indirect pointer to the location of a document, and, in some implementations, to the location in such document of a grammatical unit, that gave rise to a topic or topic modifier.
In one embodiment, the linguistic topic analyzer also dynamically assigns weights to each of the topics and topic modifiers, such weights generally representing the importance of the topic or topic modifier as measured by the linguistic relevance of the topic in the text, the frequency of its occurence, or other factors. In one embodiment, the linguistic topic analyzer also represents the linguistic content of some grammatical units by predefined special topics; that is, topics that are not dynamically generated but, rather, represent predefined commonly used categorizations, such as xe2x80x9corganizations,xe2x80x9d or xe2x80x9cpeople.xe2x80x9d
In one embodiment, the display manager includes a topic list generator, a topic list filter, a topic index generator, and a document list generator. In one implementation, the topic list generator links topics and topic modifiers stored by the linguistic topic analyzer in the topic tree data structure so that such topics are linked by weight, preferably in descending order. In alternative implementations, such order may be alphabetical or be based on other criteria. In one implementation, the topic list filter stores in a topic list those of such linked topics that are contained in documents written in a user-selected natural language. The topic index generator indexes the topics stored in the topic list so that they may be displayed, preferably in a hierarchical manner, such as a tree-type graphical user interface. In one implementation, the document list generator stores in a document list those document identifiers and topics corresponding to documents written in a user-selected natural language.
In one embodiment, the GUI generator accesses the document list to generate a display, preferably one that includes a graphical user interface. Such graphical user interface, in one implementation, includes a window, referred to as a document window, that includes document entries including document identifiers. In one aspect, each such document entry also includes an associated list of topics representing the linguistic content of the document represented by such entry""s document identifier. In one embodiment, such display also includes a second window, referred to as a topic tree window, that includes a hierarchical representation of such topics. In one implementation, such hierarchical representation includes a collapsible and expandible, tree-like graphical structure of topics, referred to herein as a xe2x80x9ctopic tree.xe2x80x9d
In one embodiment, such hierarchical representation is a single merged representation of topics that represents the linguistic content of the user-selected documents taken as a whole. In an alternative embodiment, such hierarchical representation is a single merged representation of topics that represents the linguistic content of the associated lists of topics as a whole, each such associated list of topics, as noted, representing the linguistic content of a document. In some implementations, either of such single merged hierarchical representations includes a collapsible and expandible, tree-like graphical structure of merged topics, referred to herein as a xe2x80x9cmerged topic tree.xe2x80x9d In a further implementation of either of such embodiments, such merged topic tree only includes topics that represent the linguistic content of documents written in one or more user-selected natural languages.
In one embodiment, the interface manager and display manager enable a user to display the text of one or more documents by selecting one or more document identifiers in the document window, or one or more topics in the topic tree window. In one implementation, if the user selects one or more topics from the list of topics in a document entry in the document window, or from the topics in the topic tree window, the texts of the document or documents corresponding to such selected topic or topics are displayed, and the grammatical units corresponding to the selected topic or topics are highlighted.
The linguistic filter of the present invention thus provides a display that advantageously enables a user efficiently and intuitively to select, filter, or browse through a group of selected documents based on the selection of one or more topics representing the linguistic content of one or more of the selected documents. Advantageously, each such topic is displayed in relation to other topics; that is, displayed so as to indicate the relative linguistic importance of such topics or to indicate any hierarchical relationship among them, or both.