The present invention relates to the field of fully automated, linguistic analysis of unrestricted text in different languages. Specifically, the present invention relates to an automatic method, and a corresponding apparatus, for segmentation of a stream of text elements comprising analyzed tokens into one or more initial clauses.
Although current technology for parsing whole sentences in unrestricted text has improved in recent years, the level of parsing accuracy is still not sufficient to support long intended applications of parsing technology to information systems. For example, existing information systems cannot extract from unrestricted text specific pieces of information that are parallel in lexical, constructional and semantic respects.
Examples of parallel pieces of information are portions of text that have the same agent (=grammatical subject), or the same acted upon (=grammatical object), or involve the same action (=content verb). Such extraction of information is currently only possible from texts in restricted domains. This is due to the fact that commonly used mtchods for information extraction crucially depend on manually acquired, domain specific world knowledge. Consequently, there are large and growing bodies of texts that contain valuable pieces of information that cannot be accessed by standard techniques of information retrieval, because the latter are currently restricted to retrieval of whole documents.
One principal reason why current parsing technology fails to achieve the accuracy required for large-scale applications to unrestricted text is the well-known observation in the art that the performance of parsers degrades as the length of input sentences increases. This is due to the fact that parsers target full sentences as the units to parse. As the length of a sentence increases, so does the combinatorial explosion of alternative ways to combine the well-formed substrings of a sentence that the parser has found.
In order to improve the coverage and accuracy of parsers for unrestricted text, a new divide and conquer strategy is emerging in parsing. The strategy involves the use of simple, finite state parsing techniques in a phase that is preparatory to xe2x80x98realxe2x80x99 parsing, which uses more complex techniques. The object of the preparatory stage is to partition text exhaustively into a sequence of units referred to as chunks or segments, in order to facilitate and improve later processing.
Clause segmentation is emerging as a recognized problem area. However, there is no agreement among practitioners in the field on the definition of the clauses that should result from clause segmentation, or on terminology. Units that are clauses or xe2x80x98clause likexe2x80x99 are referred to by many different names.
For the purpose of the discussion in this background section, a simple clause is a unit of information that roughly corresponds to a simple proposition, or fact. Current information retrieval technology is not based on clauses as units of information that can be used in rapid creation of databases of reported facts, that involve agents and actions of interest to end-users of information systems. An important motivation for clause segmentation is that it enables automatic recognition of basic grammatical relations within clauses (subject, object, etc). Because of this, clause segmentation makes it possible for later processes to determine which pieces of text exhibit lexical, constructional and semantic parallelism of information.
Existing methods for identifying clauses and segmenting text into clauses rely on first finding phrases within sentences, such as noun phrases and other phrases, before finding clause units within sentences. When clause units have been found, they make it possible to determine clause boundaries, i.e. where a clause begins and ends.
In Nelson, W. and Kucera, H., xe2x80x9cFrequency Analysis of English Usagexe2x80x9d, 1982, Houghton Mifflin Company, Boston, pp. 549-556, hereafter NelsonandKucera-1982, Kucera used a finite state automaton for finding verb groups in part-of-speech tagged text in the Brown corpus, and for classifying verb groups into finite and non-finite. A verb group is finite if it contains a verb in the present or past tense. A verb group is non-finite if it contains no tensed verb, i.e. if it consists of an infinitive or a present or past participle. It is commonly agreed in traditional and modern grammar that a verb group implies a predication, equivalently a clause, and that finite and non-finite predications are syntactically distinct, though related types of predications.
The disadvantage of Kucera""s 1982 finite state automaton is that it does not address the problem of identifying the location of boundaries between predication units, i.e. it is not a method that segments text into predication units. Although a subsequent patent entitled xe2x80x98Sentence analyzerxe2x80x99 to Kucera et al (U.S. Pat. No. 4,864,502) indirectly locates clause boundaries, this technique is based on first finding phrases within sentences, followed by identification of clauses, and thereafter clause boundaries.
Other techniques that analyze sentences internally first, before locating clause boundaries are known, for example: Grefenstette, G., xe2x80x9cLight parsing as finite state filteringxe2x80x9d, in A. Kornai (Ed), Extended Finite State Models of Language, 1999, Cambridge University Press, Cambridge, U.K., pp. 86-94; and Ramshaw, L. and Marcus, M., xe2x80x9cText chunking using transformation-based learningxe2x80x9d, in Proceedings of the Third Workshop on Very Large Corpora, D. Yarowsky and K. Church, Eds, June 1995, M.I.T., Cambridge, Mass., pp. 82-94. These techniques use finite state marking transducers on part-of-speech tagged text as input. The marking transducers mark both contiguous groups of nouns and contiguous groups of verbs in the output. A sentence is implicitly equated with a predication, which is assumed to be a combination of one verb group with one or more noun groups.
A serious problem with the approach is that it gives bad results for sentences that consist of several clauses. The reason is that group marking transducers typically do not recognize sentence internal clauses as clausal units.
There are other known techniques for clause segmentation, described in: Ejerhed, E., xe2x80x9cFinding clauses in unrestricted text by finitary and stochastic methodsxe2x80x9d, in Second Conference on Applied Natural Language Processing, 1988, ACL, Austin, Tex., pp. 219-227, and in Abney, S. P., xe2x80x9cRapid incremental parsing with repairxe2x80x9d, in Proceedings of the 6th New OED Conference, 1990, Waterloo, Ontario, University of Waterloo, pp. 1-9. For Ejerhed""s and Abney""s techniques, the input to the recognition of clause segments consists of part-of-speech tagged text, in which basic noun phrases have also been recognized by probabilistic techniques as described by U.S. Pat. No. 5,146,405 to Church. A problem for both of these two techniques is the following. If the recognition of a basic noun phrase is not correct, then this may result in an error in clause segmentation. For example, if a long noun phrase that has been recognized really should be analyzed as two noun phrases, then a possible clause boundary location is inaccessible.
In the framework of Constraint Grammar, there is also a module for detecting sentence internal clause boundaries, described in Karlsson et al, xe2x80x9cConstraint Grammar: A language independent system for parsing unrestricted textxe2x80x9d, 1995, Mouton de Gruyter, Berlin/New York, pp. 1-430. However, the authors report (on pages 213, 238) that the mechanism for identifying sentence internal clause boundaries is problematic and rather unsophisticated, and as a result, the other modules of constraint syntax to a great extent have to do without it.
An objective of the present invention is to provide an improved method for clause boundary detection and segmentation of unrestricted text into clauses, that is not subject to the foregoing disadvantages of existing methods for these tasks.
The invention is based on the recognition that an unrestricted text can be segmented into initial clauses using a method whose number of computations only increases linearly with the number of text elements in the text to be segmented. Furthermore, the proposed method according to the invention, in spite of its restricted number of computations, gives rise to a segmentation into initial clauses that is surprisingly useful in applications, such as automatic extraction of information from unrestricted text using a computer.
According to one aspect of the present invention a method for segmentation of a stream of text elements comprising analyzed tokens into one or more initial clauses is provided. According to the method a predetermined number of consecutive text elements of the stream of text elements are scanned, starting from a given position. The predetermined number of consecutive text elements are compared with each pattern of a set of patterns for beginnings of initial clauses. Furthermore, if said predetermined number of consecutive text elements match one pattern of said set of patterns for beginnings of initial clauses, a beginning of an initial clause is identified in said predetermined number of consecutive text elements. The scanning, comparison and identification is then repeated, wherein the given position is moved at least one position forward between each repetition.
By moving the given position at least one position forward between each repetition, the number of times the scanning, comparison and identification are made increases linearly with the number of text elements in the stream of text elements to be segmented. Furthermore, it has been shown that the segmentation resulting from the method according to the invention facilitates subsequent, automated information extraction from unrestricted text to an extent that previously has been anticipated not to be possible with such methods. This is due to the recognition of the empirical fact that the location of a clause beginning in any language is decidable on the basis of iteratively inspecting a predetermined number of consecutive text elements. Either the predetermined number of text elements contain a sequence that is a clause beginning according to a short language specific list of such sequences, or it does not. Furthermore, it is also due to the recognition of the distributional fact that the restrictions on co-occurrence of content words within clauses are numerous and strong, whereas the restrictions on co-occurrence of content words across clause boundaries are fewer and weaker. For this reason, an initial clause fulfills the requirement of being a maximally independent processing unit. This can be utilized by an incremental sentence processing model based on the invention.
One distinctive feature of the method according to the invention is that it reverses the order used in all prior art of first doing sentence internal parsing, before recognizing clauses and clause boundaries. Instead, the method of the invention first recognizes clause boundaries in text that has only been part-of-speech tagged and not parsed, before doing clause internal parsing. This reversal of order of processing improves the accuracy and robustness of clause boundary detection in that dependence on prior processing steps is minimized. The reversal is manifested in that the input to the method according to the inventions is a stream of text elements comprising analyzed tokens. Thus, the pre-analysis of an unrestricted text that gives rise to the stream of text elements is limited to an analysis at the level of individual word tokens, including punctuation tokens.
Another distinctive feature of the invention is that the clause boundaries that are recognized early in a sequence of text analysis steps are boundaries of linguistic units that are here termed initial clauses. Initial clauses have the property of being non-recursive, i.e. no two initial clauses overlap each other.
In one embodiment of the method according to the invention, a marker for begin initial clause is inserted into said predetermined number of consecutive text elements in response to an identified beginning of an initial clause in the step of identifying. This has the advantage that it simplifies subsequent analysis of the segmented text. However, it is to be noted that any other way of indicating in the segmented text the clause boundaries, such as having a pointer pointing at the locations of initial clause beginnings, is equally applicable.
Furthermore, in the above embodiment, each pattern of said set of patterns is preferably associated with an action and the marker for begin initial clause is inserted into the predetermined number of consecutive text elements in accordance with the action associated with the pattern that the predetermined number of consecutive text elements match. Furthermore, the action preferably determines in which position of the predetermined number of consecutive text elements the marker for begin initial clause is to be inserted.
To further facilitate subsequent analysis of the segmented text resulting from the method according to the invention, one embodiment of the method includes the indication, for each marker for begin initial clause, of which pattern of said patterns for beginnings of initial clauses caused the insertion of the marker.
According to another embodiment of the invention, an additional set of steps are performed for each of the initial clauses. According to this embodiment, the text elements of each initial clause are scanned and compared with each pattern of a met of patterns for multiple finite verbs. If the text elements of an initial clause match one pattern for multiple finite verbs, a beginning of an initial clause is identified in the text elements of this initial clause. This embodiment of the method according to the invention enhances the resulting segmented text even further in terms of its applicability to subsequent, automated information extraction from unrestricted text and the like.
The method of clause segmentation according to the invention can be used directly to improve the speed and accuracy of sentence processing in text analysis systems for unrestricted text.