The invention pertains to the field of text reduction by selecting the key content thereof and, more particularly, to an apparatus and method for intelligently analyzing and highlighting key words/phrases, key sentences and/or key components of an electronic document by recognizing and utilizing the context of both the electronic document (which may be any type of electronic message such as e-mail, converted voice, fax or pager message or other type of electronic document) and the user.
The volume of information in the form of text, particularly electronic information, being communicated to users is increasing at a very high rate and such information can take many forms such as simple voice or electronic messages to full document attachments such as technical papers, letters, etc. Because of this, there is a growing need in the communications, data base management and related industries for means to intelligently condense electronic text information for purposes of assisting the user in handling such communications and for effective storage and retrieval of the information.
The known document condensers (sometimes also referred to as key word/phrase xe2x80x9cextractorsxe2x80x9d or as xe2x80x9csummarizersxe2x80x9d), which typically function to identify a set of key words/phrases by utilizing various statistical algorithms and/or pre-set rules, have had limited success and limited scope for application. One such known method of condensing text is described in Canadian Patent Application No. 2,236,623 by Turney which was laid open on 23 Dec. 1998; the Turney method disclosed by this reference relies upon the use of a preliminary teaching procedure in which a number of pre-set teaching modules, directed to different document categories or academic fields, are provided and a selected one is run prior to using the text condenser in order to revise and tune a set of rules used by the condenser so as to produce the best results for documents of a selected category or within the selected academic field. However, such prior condensers do not advance the art appreciably because they are primarily statistically based and do not meaningfully address semantic factors. As such they are directed to producing lengthy indices of key words and phrases per se with the result that the relationships or concepts between those key words and phrases is often lost. They also ignore the intent of the electronic document and, hence, treat news, papers, discussions, journal papers, etc. generically.
The inventors herein have identified that the difficulty faced by any means of generating a summary of the key content of a given body of text of an electronic document, which must be overcome, is in recognizing and accommodating the specific context of the text. This is because electronic documents of various types are typically not authored in a structured or consistent manner. In addition, in some cases the context of the user may be an important factor to be accommodated because the interpretation of the meaning of a given body of text by one reader is personal to that reader and may not be the same interpretation made by another reader.
For example, by recognizing that a given electronic document is a discussion email, as distinguished from a technical paper or a news item, a particular structure can be assigned to that text for purposes of analysis. This is because email messages are typically informal (colloquial), less structured, shorter, have less redundancy and are often continuations of earlier email messages. By contrast, technical papers typically comprise a formal language format and are themselves structured according to a standard format (such as having a title and section headings, an opening summary, a background section, etc.). Similarly, news items have associated with them a pyramid-type format, usually providing the key content within the first paragraph or two (see Mittal V. et al xe2x80x9cSelecting Text Spans for Document Summaries: Heuristics and Metricsxe2x80x9d, American Association of Artificial Intelligence 1999 Conference Proceedings).
It has been found that the specific type of the electronic document which is to be processed, referred to herein as the xe2x80x9capplication contextxe2x80x9d, can be determined from the document text and format and the environment of the text which is referred to herein as the envelope of the electronic document. For example, it can be determined whether the text has an ASCII or HTML format and whether it arrived as an email or an attachment or otherwise. Text which is correspondence will typically have an opening salutation such as xe2x80x9cDear Johnxe2x80x9d, a main body of text and a signature block with one of the words xe2x80x9cregardsxe2x80x9d, xe2x80x9ctrulyxe2x80x9d, xe2x80x9csincerelyxe2x80x9d, etc. For email discussions of an on-going nature they may have been forwarded or may be a part of a reply message and some of the content thereof may be indented by the de facto standard character xe2x80x9c greater than xe2x80x9d. Once the application context of the electronic document has been determined the highlighting process can be assisted by differentiating between the envelope and the text components of the document; for example, on the basis of this information any superfluous information such as the salutation and signature block may be identified and removed. The particular application context may also dictate the handling of certain information which is typically relevant to that context.
Additional context information relating to an electronic document, referred to herein as the xe2x80x9cuser contextxe2x80x9d, which can be useful to infer the meaning of the text of that document may be obtained from knowledge of the user. That is, knowledge of the specific user context might, in some cases, assist in a determination as to which components of a given body of text are relevant. One example of this which would apply to the optimal automation of a personal text highlighter used, say, for processing one""s received electronic messages, is that an electronic document which has been recognized to be a product/service advertisement of the type (i.e. determined from the envelope, for example) which the user normally deletes, could simply be truncated without any analysis applied to it; this would occur where it has been learned from the user context that the particular user is not interested in the content of such a document. On the other hand, advertisements which are targeted to the user through pre-selected identifiers could instead be highlighted for the user. Further examples in which the user context may be effectively utilized include the situation where correspondence received from one sender may be more important to the user than correspondence from another sender, where the time of receipt of certain correspondence may determine a particular importance level to the user and where specific words may be used more frequently by the user and these might be associated with a particular degree of relevance. Thus, the behaviour pattern and the situation of the user provides additional context parameters on which a process for highlighting the key components of the text of an electronic document may be based.
Reference herein to xe2x80x9chighlightingxe2x80x9d means an electronic process of selecting the key components of a given body of electronic text (e.g. in the form of key words/phrases, key sentences or parts thereof and/or key elements thereof, and not simply a string of disjointed keywords), the result appearing analogous to that which would be obtained by the commonly used manual method of highlighting a printed copy of the text using a fluorescent ink marker.
In accordance with the invention there is provided computer-readable apparatus for highlighting the content of a user""s electronic input document and producing therefrom an electronic output highlight document. An application context module is provided for determining with respect to the input document the type of document it is. A user context module determines the context of the user with respect to the input document. A highlighter module determines at least a portion of the key content of the input document, up to a predetermined maximum data size, at least in part on the basis of the determinations made by the application and user context modules. Means are provided for producing the output highlight document from the key content.
Preferably a document mapping module is provided for producing a static document map of the content of the input document, wherein the highlighter module applies to the static document map weights and/or conditions derived from the determinations made by the application and user context modules to determine key content therefrom. The key content may comprise key words/phrases, key sentences and/or key components of the input document. The determination of key content by the highlighter module may result from mathematically calculating scores in respect of the content of the document map. A portion of the key content may be determined by one or both of the application and context modules and the application, context and highlighting modules determine the key content on a graduated basis whereby content is excluded only if necessary in order to satisfy the limitation of the predetermined maximum data size.
Also in accordance with the invention there is provided a method comprising the steps of determining with respect to the input document the type of document it is; determining the context of the user with respect to the input document; determining at least a portion of the key content of the input document, up to a predetermined maximum data size, at least in part on the basis of the determinations of the type of document it is and the context of the user; and, producing the output highlight document from the key content.