The present invention relates to systems and methods for text processing and, in particular, to text structuring and generation of new text having varying degrees of compression.
The desirability of summaries, annotations and abstracts has greatly increased in recent years because of the large quantity of publicly available on-line, machine-readable information. The generation of summary documents serves a valuable function by reducing the time required to review and understand the substance of one or more full-length documents. The generation of document summaries, annotations or abstracts can be performed manually or automatically. Manual summarization relies on an individual summarizing the document, and can be costly, time consuming and inaccurate. Summaries generated automatically, however, can be produced more efficiently, cheaper and with greater accuracy.
Conventional text processing techniques for natural language typically treat text as a sequence of codes. The codes used include alphabetic and numeric character codes, as well as punctuation mark codes and carriage-control codes that indicate carriage operations such as spaces, tabs and carriage returns.
The processing to natural language text is a computationally intensive process. Producing semantically correct summaries and abstracts is difficult using natural language processing when document content is not limited. Two of the most difficult processes in automated text structuring of natural language text (particularly when ambiguity of text is considered) are: 1) automatically explicating from text, all meaningful groups of words, phrases, simple and compound sentences; and 2) automatically encapsulating meaningful groups of words within the boundaries of a generalized notion that is considered a text unit of coarser granularity.