The invention pertains to the field of text interpretation, representation and reduction and, more particularly, to a computer system and method for intelligently identifying concept(s) relating to an electronic document and using this knowledge to reduce and/or represent the text content of an electronic document (which may be any type of electronic document including Web pages, electronic messages such as e-mail, converted voice, fax or pager message or other type of electronic document).
The volume of information in the form of text, particularly electronic information, being communicated to users is increasing at a very high rate and such information can take many forms such as simple voice or electronic messages to full document attachments such as technical papers, letters, etc. Because of this, there is a growing need in the communications, data base management and related electronic information industries for means to intelligently condense electronic text information for purposes of assisting the user in handling such communications and for effective classification, archiving and retrieval of the information.
The known document condensers (sometimes also referred to as key word/phrase xe2x80x9cextractorsxe2x80x9d or as xe2x80x9csummarizersxe2x80x9d), which typically function to identify a set of key words/phrases by utilizing various statistical algorithms and/or pre-set rules, have had limited success and limited scope for application. One such known method of condensing text is described in Canadian Patent Application No. 2,236,623 by Turney which was laid open on 23 December, 1998; the Turney method disclosed by this reference relies upon the use of a preliminary teaching procedure in which a number of pre-set teaching modules, directed to different document categories or academic fields, are provided and a selected one is run prior to using the text condenser in order to revise and tune a set of rules used by the condenser so as to produce the best results for documents of a selected category or within the selected academic field.
However, such prior condensers do not advance the art appreciably because they are primarily statistically based and do not meaningfully address semantic or global linguistic factors which might affect or govern the document text. As such they generally produce only lengthy sets or strings of key words and phrases per se and the relationships or concepts between those key words and phrases is often lost in the resulting summary. The prior condensers also ignore the intent of the electronic document and, hence, treat news, articles, discussions, journal papers, etc. generically.
In the applicant""s co-pending U.S. application Ser. No. 09/494,312 filed on 21 January, 2000, which is incorporated herein by reference, there is disclosed a computer-readable system for intelligently analyzing and highlighting key words/phrases, key sentences and/or key components of an electronic document by recognizing and utilizing the context of both the electronic document and the user. In accordance with that system a document map is created by removing from the input document the white space (i.e. formatting such as line spacing), designated first stage xe2x80x9cexcludexe2x80x9d words, which may be defined as conjunctive words (i.e. such as the words xe2x80x9candxe2x80x9d, xe2x80x9cwithxe2x80x9d, xe2x80x9cbutxe2x80x9d, xe2x80x9ctoxe2x80x9d, xe2x80x9choweverxe2x80x9d, etc.), articles (i.e. such as the words xe2x80x9cthexe2x80x9d, xe2x80x9caxe2x80x9d, xe2x80x9canxe2x80x9d, etc.), forms and tenses of the words xe2x80x9cto havexe2x80x9d and xe2x80x9cto bexe2x80x9d and other filler words such as xe2x80x9cthanksxe2x80x9d, xe2x80x9cTHXxe2x80x9d xe2x80x9cbyexe2x80x9d etc., and then the text is stemmed by removing suffixes from applicable words to produce the root thereof (lower case letters only and without punctuation). For example, the words xe2x80x9ccomputationalxe2x80x9d and xe2x80x9ccomputerxe2x80x9d would both be stemmed to the same root viz. xe2x80x9ccomputxe2x80x9d. The document map preserves the sentence and paragraph structure of the document and includes stem maps and a frequency count designation is assigned to each stem such that it provides a complete list of all word/phrase stems with a frequency count per stem and sentence demarcation (a phrase being a preselected number of consecutive words containing no punctuation or exclude words).
The negation key phrases of the document map are identified using a negation words list and by determining whether the word xe2x80x9cnotxe2x80x9d is in any form (e.g. as xe2x80x9cn""txe2x80x9d in the words xe2x80x9ccouldn""txe2x80x9d, xe2x80x9cshouldn""txe2x80x9d, xe2x80x9cwouldn""txe2x80x9d, xe2x80x9cwon""txe2x80x9d, etc.) present in a phrase. These negation key phrases are flagged and given a weight for purposes of scoring them. The action key phrases of the document map are identified using a verbs list and they are scored on the basis of assigned context weights and conditions. The remaining words/phrases of the document are scored in the manner described in the aforementioned Canadian patent application No. 2,236,623 to Turney but with the important improvement of making use of context determinations of the system which identify xe2x80x9cinclude/excludexe2x80x9d words/phrases. In addition, sentences are scored whereby sentences in a document having a higher number of highly ranked words/phrases are themselves, as a whole, given a relatively high ranking.
The inventor herein has discovered that the interpretation and summarization of the text of an electronic document is improved by determining the concept(s) to which the text relate(s) and, in appropriate cases, utilizing this knowledge of the governing concept to produce a representation of the text content rather than a simple summarization or condensed extract thereof.
In accordance with the invention there is provided a computer-readable concept identification system and for use in reducing and/or representing text content of an electronic document. A concept knowledge base includes a plurality of concepts wherein each concept comprises one or more subconcepts linked to each other and to such concept on a hierarchical basis and wherein one or more of the subconcepts may be linked to one or more subconcepts of another concept. A concept matching module matches text of the document to subconcepts of the concept knowledge base and assesses any links between the matched subconcepts and other concepts and/or subconcepts of the concept knowledge base. From this a determination is made whether the document relates to a concept of the knowledge base. The subconcepts preferably include synonyms therefore.
A document representation generator may be provided for producing a precis of the document based on a template associated with the determined concept. An output module is provided for communicating an identification of the concept determined by the matching module.
Also in accordance with the invention there is provided a computer-readable system and method for highlighting the content of an electronic document and producing therefrom an electronic output highlight document. A concept identification system is provided according to the foregoing and a highlighter module is provided for determining key content of the input document. The highlighter module includes a comparing module for comparing content of the input document to the subconcepts of the concept knowledge base for the determined concept for purposes of determining the key content. An interface integrates the concept identification system and the highlighter module. An output module produces an output highlight document from the key content.
A document mapping module is preferably provided for producing a static document map of the content of the input document, wherein the highlighter module applies to the static document map weightings derived from determinations made by the comparing module.