The present invention relates to techniques for processing natural language text that take into account its punctuation. More specifically, the invention relates to data structures that include information about the punctuational structure of natural language text.
Conventional data processing techniques for natural language ordinarily treat text as a sequence of codes. The codes used include alphabetic and numeric character codes as well as punctuation mark codes and carriage-control codes that indicate carriage operations such as spaces, tabs and carriage returns. When the text is presented, by printing or display, these codes control the placement of alphanumeric characters and other marks and symbols.
A number of non-printing characters that facilitate text editing, document layout, and page formatting are described on pages 52-56 of Text Editing, VP Series Reference Library, Version 1.0, Xerox Corporation, 1985, pp. 47-56, describing the ViewPoint Document Editor available from Xerox Corporation. A user can display these characters during editing, for example, for use in document formatting and layout. The characters described include special formatting characters, such as spaces, tabs, and new paragraph characters. The characters also include structure characters, including page format characters, field bounding characters, and frame anchors.
The ViewPoint Document Editor also enables the user to select and operate on units of text based on the codes corresponding to these non-printing characters. For example, as described at pages 47-52 of Text Editing, text can be selected in units of characters, words, sentences or paragraphs using multiple mouse button clicks. The editor uses special rules to interpret text as words or sentences. The rules treat each grouping of text characters as a word, and include spaces depending on the presence or absence of a trailing space and a leading space. The rules treat each sequence of words and symbols that is bounded by punctuation marks as a sentence, and include spaces depending on the presence or absence of spaces after the trailing punctuation mark and before the first character of the sentence.
Various other commercial products have features similar to those of ViewPoint, including selection commands from the keyboard or with a mouse or similar pointer control device. Conventionally, a single click with a pointer control device button selects a region that starts at the character boundary nearest the position of the pointer at the time of the click. In one approach, the region selected by a single click contains no characters, but may be extended one character at a time by moving the pointer over the characters to be added. In another approach, the region selected by a single click contains one character, and may be extended arbitrarily by a single click of a different button with the pointer at the desired ending point of the selection.
It is also conventional to provide selection by double-clicking, or clicking twice in succession with the pointer at the same position. Double-clicking usually selects the word most closely surrounding the pointer position, and subsequent adjustments of the selection are usually made a word at a time. For example, the MacIntosh personal computer from Apple Corporation provides a user interface in which multiple clicking selects a word. Word, a commercial text editor from Microsoft Corporation, provides extension of such a selection to additional full words. Microsoft Word and other text editors, including WordPerfect from WordPerfect Corporation and Emacs available with source code from Free Software Foundation, Cambridge, Mass., allow selection of a sentence and extension of such a selection to additional full sentences. Microsoft Word and Fullwrite Professional from Ashton-Tate Corporation further allow selection by paragraph. Fullwrite Professional also allows the user to provide a quotation mark without indicating whether it is at the open or close of a quote, the software correctly providing an open or close quotation mark based on previous marks.
Text Editing and Processing, Symbolics, Inc., #999020, July 1986, pp. 24-31 and 63-111, describes text editing features of a version of Emacs called "Zmacs." Pages 67-70 describe mouse operations, including clicking on a word to copy the whole word; on a parenthesis to copy it, its matching parenthesis, and the text between them; on a quotation mark to copy it, its matching quotation mark, and the text between them; or after or before a line to copy the whole line. Appropriate spaces are placed before inserted objects, so that spaces are automatically inserted around an inserted word or sentence. Pages 71-75 describe motion commands, including motion by word, meaning a string of alphanumeric characters; by sentence, ending with a question mark, period, or exclamation point that is followed by a newline or by a space and a newline or another space, with any number of closing characters between the sentence ending punctuation and the white space that follows; and by line, delimited by a newline. Page 79 describes motion by paragraph, delimited by a newline followed by blanks, a blank line, or a page character alone on a line; page 80 describes motion by page, delimited by a page character. Chapter 5, pages 83-97, describes deleting and transposing text, with pages 87-89 describing how contents of a history are retrieved. Chapter 6, pages 99-111, describes working with regions, and discusses point and mark.
Conventional line-breaking and pagination techniques take punctuation into account. These techniques apply rules to the sequence of codes to determine break points between lines or pages.
Kaplan, R. M. and Bresnan, J., "Lexical-Functional Grammar: A Formal System for Grammatical Representation," in Bresnan, J. (Ed.), The Mental Representation of Grammatical Relations, Cambridge, MIT Press, 1982, pp. 173-281, describe in section 4.1 how to assign a structure to a sentence by following the ordinary rewriting procedure for context-free grammars for a set of rules. Section 6 shows that functional structure in lexical-functional grammar (LFG) is an autonomous level of linguistic description, with a mixture of syntactically and semantically motivated information but distinct from both constituent structure and semantic representation.
A variety of patent documents relate to natural language punctuation.
Kucera et al., U.S. Pat. No. 4,773,009, describe a text analyzer that analyzes strings of digitally coded text to determine paragraph and sentence boundaries. As shown and described in relation to FIGS. 3-4, each string is broken down into component words. Possible abbreviations are identified and checked against a table of common abbreviations to identify abbreviations that cannot end a sentence. End punctuation and the following string are analyzed to identify the terminal word of a sentence. When sentence boundaries have been determined, a grammar checker, punctuation analyzer, readability analyzer, or other higher-level text processing can be applied.
Kumano et al., EP-A 230,339, describe a machine translation system that includes punctuation mark generation, as shown and described in relation to FIGS. 6-12C. The insertion of punctuation is based on the syntactic structure of a translated sentence.
Sakaki et al., U.S. Pat. No. 4,599,691, describe a tree transformation system for machine translation of natural language. A sentence is pre-processed and processed taking into account its end mark and using parsing grammars illustrated at col. 6, lines 22-57. The result is a tree structure, as shown and described in relation to FIGS. 1-5. Tree fragments are used in translation, as described in the remainder of the patent.
Nitta et al., U.S. Pat. No. 4,641,264, describe a natural language translation method in which a sentence is analyzed into phrases, as shown and described in relation to FIG. 2, in the process of translation. As shown in Table 1A, commas and periods are treated as parts of speech. A tree/list analysis pattern storage area includes pattern tables for verbs, conjunctions, clauses, and sentences, examples of which are shown and described in relation to FIGS. 8A-8D. These patterns are used in the process of obtaining a sentence's skeleton pattern, which is then used in translation.
Katayama et al., EP-A 180,888, describe natural language processing techniques that use a dictionary of grammatical rules based on categories such as predicate, noun, etc., as shown and described in relation to Table 4. The grammatical rules are used to process a sentence, as shown and described in relation to FIG. 4, until an end mark is reached. Semantic analysis, as shown and described in relation to FIG. 6, is also used to obtain a syntax tree, as shown in FIG. 7.
Amano et al., U.S. Pat. No. 4,586,160, describe techniques for analyzing a natural language sentence's syntactic structure. The syntactic category of each word in a sentence is obtained by consulting a dictionary, but if the word is not in the dictionary, a category is supplied independently of the dictionary. As discussed at col. 2, lines 59-64, the position of a punctuation mark is also noted. As shown and described in relation to FIGS. 2 and 3, the words are combined to form upper order sentence units.
Yoshida, U.S. Pat. No. 4,594,686, describes an electronic language interpreter. FIG. 3 shows memory contents representing information about inflection of an adjective, specifically relating to addition of an umlaut. As shown in relation to Tables 1-4, the appropriate inflection of a word depends on how it is used. A first memory stores uninflected forms of words in one language. A second memory stores words in a second language equivalent to each word in the first memory. A third memory contains data indicating inflection principles used to properly inflect the forms in the first memory based on inflection selection by the user.
Snow, U.S. Pat. No. 4,597,057, describes a text compression system in which a punctuation sequence, including common punctuation marks, spaces, tabs, end of line sequences, form feeds, capitalization, underlining and so forth, is encoded as a punctuation token, as shown and described in relation to FIGS. 1, 2, and 9.
Lange et al., U.S. Pat. No. 4,674,065, describe a system for automatically proofreading a document for word use validation. As shown and described in relation to Table 3, the rules for inspecting text surrounding a potentially confusable word take punctuation into account, including period, comma, blank and capitals.