With the advents of the printing press, typeset, typewriting machines, computer-implemented word processing and mass data storage, the amount of information generated by mankind has risen dramatically and with an ever quickening pace. As a result there is a continuing and growing need to collect and store, identify, track, classify and catalogue for retrieval and distribution this growing sea of information. One traditional form of cataloging and classifying information, e.g., books and other writings, is the Dewey Decimal System. Increasingly, the world's economies and supporting infrastructures, including research systems, are becoming global in nature and as systems allow for cross-lingual searching information available to researchers continues to expand. A growing field of research and development is in the area of extracting relationships and other metadata about documents based on terms or patterns or discerned attributes among documents in large databases. By deriving relationship information systems can draw conclusions and connections between documents, authors, subjects and events that aid users in researching and other efforts.
In many areas and industries, including the financial and legal sectors and areas of technology, for example, there are content and enhanced experience providers, such as The Thomson Reuters Corporation. Such providers identify, collect, analyze and process key data for use in generating content, such as law related reports, articles, etc., for consumption by professionals and others involved in the respective industries, e.g., lawyers, accountants, researchers. Providers in the various sectors and industries continually look for products and services to provide subscribers, clients and other customers and for ways to distinguish their firms over the competition. Such provides strive to create and provide enhance tools, including search and ranking tools, to enable clients to more efficiently and effectively process information and make informed decisions.
For example, with advancements in technology and sophisticated approaches to searching across vast amounts of data and documents, e.g., database of legal documents or records, published articles or papers, etc., professionals and other users increasingly rely on mathematical models and algorithms in making professional and business determinations. Existing methods for applying search terms across large databases of documents have room for considerable improvement as they frequently do not adequately focus on the key information of interest to yield a focused and well ranked set of documents to most closely match the expressed searching terms and data. Although such computer-based systems have shortcomings, there has been significant advancement over searching, identifying, filtering and grouping documents by hand, which is prohibitively time-intensive, costly, inefficient, and inconsistent.
Search engines are used to retrieve documents in response to user defined queries or search terms. To this end, search engines may compare the frequency of terms that appear in one document against the frequency of those terms as they appear in other documents within a database or network of databases. This aids the search engine in determining respective “importance” of the different terms within the document, and thus determining the best matching documents to the given query. One method for comparing terms appearing in a document against a collection of documents is called Term Frequency-Inverse Document Frequency (TFIDF or TF-IDF). In this method a percentage of term count as compared to all terms within a subject document is assigned (as a numerator) and that is divided by the logarithm of the percentage of documents in which that term appears in a corpus (as the denominator). More specifically, TFIDF assigns a weight as a statistical measure used to evaluate tile importance of a word to a document in a collection of documents or corpus. The relative “importance” of the word increases proportionally to the number of times or “frequency” such word appears in the document. The importance is offset or compared against the frequency of that word appearing in documents comprising the corpus. TFIDF is expressed as the log(N/n(q)) where q is the query term, N is the number of documents in the collection and N(q) is the number of documents containing q. TFIDF and variations of this weighting scheme are typically used by search engines, such as Google, as a way to score and rank a document's relevance given a user query. Generally for each term included in a user query, the document may be ranked in relevance based on summing the scores associated with each term. The documents responsive to the user query may be ranked and presented to the user based on relevancy as well as other determining factors.
With regards to training an SVM, Published Pat. App. US2007/0282766 (Hartman et al.) entitled Training a Support Vector Machine With Process Constraints, which is hereby incorporated herein in the entirety, describes a system and method for training a support vector machine (SVM) and particularly a model (primal or dual formulation) implemented with an SVM and representing a plant or process with one or more known attributes. Process constraints that correspond to the known attributes are specified, and the model trained subject to the one or more process constraints. The model includes one or more inputs and one or more outputs, as well as one or more gains, each a respective partial derivative of an output with respect to a respective input. In the manner described, the trained model may be used to control or manage the plant or process.
More particularly in NLP pursuits, the rhetorical relations that hold between clauses in discourse 1) minimally index temporal and event information, and 2) contribute to a discourse's pragmatic coherence (Andrew Kehler, Coherence, Reference, and the Theory of Grammar, CSLI Publications, Stanford, Calif., 2002; Jerry R. Hobbs, On The Coherence and Structure of Discourse, CSLI Technical Report, CSLI-85-37, 1985). From a Natural Language Processing (NLP) perspective, being able to recover the discourse structure of a text has been motivated by the improvement it affords to discourse processing tasks such as natural language generation (Eduard H. Hovy, Automated Discourse Generation Using Discourse Structure Relations, Artificial Intelligence 63, 341-385, 1993) and text summarization (Daniel Marcu, Improving Summarization Through Rhetorical Parsing Tuning, Proceedings of The 6th Workshop on Very Large Corpora, 206-215, 1998). In a 2002, paper Schilder describes a simple discourse parsing and analysis algorithm that combines a formal under-specification utilizing discourse grammar with Information Retrieval (IR) techniques. Frank Schilder, Robust Discourse Parsing via Discourse Markers, Topicality and Position. Natural Language Engineering, 2002, Vol. 8, Issue 2-3, pages 235-255. The Kehler, Hobbs, Hovy, Marcu and Schilder papers, articles and publications cited hereinabove are incorporated herein by reference in the entirety.
As described at http://www.seas.upenn.edu/˜pdtb website, the Penn Discourse Treebank (PDTB) is a large scale corpus annotated with information related to discourse structure and discourse semantics. While there are many aspects of discourse that are crucial to a complete understanding of natural language, the PDTB focuses on encoding discourse relations. The annotation methodology follows a lexically-grounded approach. The PDTB has strived to maintain a theory-neutral approach with respect to the nature of high-level representation of discourse structure, in order to allow the corpus to be usable within different theoretical frameworks. Theory-neutrality is achieved by keeping annotations of discourse relations “low-level”: Each discourse relations is annotated independently of other relations, that is, dependencies across relations are not marked.
The PDTB is a project aimed at supporting the extraction of a range of inferences associated with discourse relations, for a wide range of NLP applications, such as parsing, information extraction, question-answering, summarization, machine translation, generation, as well as corpus based studies in linguistics and psycholinguistics. The PDTB project also aims to conduct empirical research with the PDTB corpus, for NLP as well as theoretical linguistics. Discourse relations in the current version of the PDTB are taken to be triggered by explicit phrases or by structural adjacency. Each relation is further annotated for its two abstract object arguments, the sense of the relation, and the attributions associated with the relation and each of its two arguments. The annotations in the PDTB are aligned with the syntactic constituency annotations of the Penn Treebank.
Two documents that describe the PDTB-2.0 corpus and PDTB annotation guidelines, annotation format, and summary distributions are: 1) Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi and Bonnie Webber, The Penn Discourse Treebank 2.0, Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco; and 2) The PDTB Research Group. 2008, The PDTB 2.0. Annotation Manual, Dec. 17, 2007, both available at the http://www.seas.upenn.edu/˜pdtb website and incorporated herein by reference in the entirety.
Focusing on the PDTB, the ability to predict rhetorical relations explicitly cued with a discourse marker (45% of the annotated relations in the PDTB) is very straight forward from a machine learning perspective. For example, Emily Pitler, Mridhula Raghupathy, Hena Mehta, Ani Nenkova, Alan Lee and Aravind Joshi, Easily Identifiable Discourse Relations, Proceedings of the 22nd international Conference on Computational Linguistics (COLJNG-08), 2008, achieved a 93.09% four-way accuracy based on the explicit marker alone (predicting four rhetorical relation class TEMPORAL, EXPANSION, COMPARISON and CONTINGENCY). The Pitler (2008) paper cited hereinabove is incorporated herein by reference in the entirety.
Consider (1):
    Example (1) a. Pascale finished Fox in Sox.            b. Then she walked to the bookcase to get The Cat in the Hat,        c. which is her favorite book.        d. But the book was too high to reach.        e. So she grabbed Green Eggs and Ham.        
In (1), the NARRATION (or TEMPORAL.SYNCHRONOUS.SUCCESSION in the PDTB) relation holds between the actions in (1a-b) as (1b) follows (1a) at event time. The EXPANSION relation, providing more information about Pascale and The Cat in the Hat, holds between (1b-c). (1c) is temporally inclusive (subordinated) with (1b); there is no temporal progression at event time. The CONTRAST relation (1c-d) is temporally inclusive as well and sets an expectation for a RESULT relation which holds between (1d-e), temporally following the event progression in (1a-b).
The correspondence of these relations to the explicit discourse markers—e.g., then (1b), which (1c), but (1d) and so (1e)—is both obvious (i.e., part of the pragmatic system of English) and systematic. However, in the absence of an explicit marker, rhetorical relations must be inferred either from the content of clauses themselves (e.g., what is described and how) or some pragmatic phenomenon (e.g., clause position relative to other clauses, variance in specificity of reference, etc.). To illustrate, consider (2):    Example (2) a. Pascale finished Fox in Sox.            b. She walked to the bookcase to get The Cat in the Hat,        c. Her favorite book.        d. The book was too high to reach.        e. She grabbed Green Eggs and Ham        
If markers are missing, the rhetorical structure (progression of relations) between (1) and (2) is arguably similar and open to wider interpretation, but recoverable. In the PDTB, the ability to predict implicit relations (39% of the annotated relations) has proven to be quite difficult compared to their explicit counterparts. For example, (Emily Pirler, Annie Louis and Ani Nenkova. 2009. Automatic Sense Prediction for Implicit Discourse Relations in Texr. In Proceedings of the Association for Computational Linguistics and the international Joint Conference on Natural Language Processing of the Asian Federation of Natural Ltlnguage Processing (ACL-IJCNLP-09) 683-691—Pitler (2009)) and (Zhi-Min Zhou and Yu Xu and Zheng-Yu Niu and Man Lan and Jian Su and Chew Lim Tan. 2010. Predicting Discourse Connectives for Implicit Discourse Relation Recognition. In Proceedings of the 2010 International Conference on Computational Linguistics, Poster Volume, 1507-1514—Zhou (2010)) achieve between a 36.24 and 40.88 macro-F1 for four rhetorical relation classes based on 10-12 features. This is a significant increase in complexity for mediocre performance. Both Pitler (2009) and Zhou (2010) are incorporated herein by reference in the entirety.
This following is background on discourse structure, the PDTB and the current state of implicit relation prediction. There are several different theories of rhetorical relations and the structure of texts (e.g., Discourse Structure Theory (Grosz and Sidner, 1986), Rhetorical Structure Theory (“‘RST”) (Mann and Thompson, 1987) and Segmented Discourse Representation Theory (“SDRT”) (Asher and Lascarides, 2003)). Depending on the theory, there can be a range of theoretically informed predetermined relations (e.g., RST contains roughly 30 relations whereas SDRT contains only about 12). However, any given inventory of rhetorical relations covers the same type of pragmatic phenomenon with varying degrees of specificity and generality. For example, RST contains VOLITIONAL and NON-VOLITIONAL CAUSE relations whereas SDRT only has CAUSE. Previous machine learning tasks related to these theories report a wide range of prediction for all target rhetorical relations combined: 49.70% (6-way classifier) (Daniel Marcu and Abdessarnad Echihabi. 2002. An Unsupervised Approach to Recognizing Discourse Relations. In Proceedings of the Association of Computational Linguistics (ACL-02) 2002, 368-375—Marcu (2002)); 57.55% (5-way) (Caroline Sporleder and Alex Lascarides. 2005. Exploiting Linguistic Cues to Classify Rhetorical Relations. In Proceedings of Recent Advances in Natural Language Processing (RANLP-05), 532-539-Sporleder (2005)); and 70.707 {, 8 way (sentence internal relations)) (Mirella Lapata and Alex Lascarides. 2004. Inferring Sentence Internal Temporal Relations. In Proceedings of the North American Association of Computational Linguistics (NAACL-04) 2004, 153-160—Lapata (2004)) and individual relations—e.g., CONTRAST (43.64%); CONDITION (69%) and ELABORATJON (82%) (Sporleder (2005)). The Grosz et al., Mann et al., Asher et al., Marcu (2002), Sporleder et al., and Lapata et al. papers, articles and publications cited hereinabove are incorporated herein by reference in the entirety.
For purposes of describing the background efforts, “rhetorical relations” may be used interchangeably with “sense” (and indicated with SMALL CAPS) as this is the preferred term in the PDTB. The PDTB draws inspiration from the previously mentioned theories of discourse, but does not adopt a specific framework. Rather, the PDTB centrally relies upon the ability of humans to recognize (and agree to) senses whether indexed explicitly with a discourse marker or not (implicit).
There are over 40 senses assignable in the PDTB which exist in a collapsible hierarchy. At the highest (Class) level, there are 4 senses: TEMPORAL, CONTINGENCY, COMPARISON and EXPANSION. One level down (Type), there are 16 additional senses. At the lowest (Subtype) level, there are 23 additional senses. For sake of space, the full hierarchy is not presented here (see generally, (Prasad et al., 2008)), but the hierarchy is expressed in the sense name as CLASS.TYPE.SUBTYPE. An example PDTB annotation from WSL0790 is in Example (3):    Example (3) a. Explicit, but, COMPARISON, CONTRAST            As a critique of middle-class mores, the story is heavy-handed but its unsentimental sketches of Cairo life are vintage Mahfouz        b . . .        c. Implicit, because, CONTINGENCY.CAUSE.REASON        The prose is closer to Balzac's “Pere Goriot” than it is to “Arabian Nights” (because) it is Mahfouz began writing when there was no novelistic tradition in Arabic        
In Example (3), each PDTB annotation, which holds between two spans of text (Arg1, Arg2), indicates whether the relation is Explicit (3a) or Implicit (3c), the actual discourse marker if it is explicit—if it is implicit, the PDTB annotation provides an adjudicated marker that captures the relations because in (3c). Alternative Lexicaliztions (AltLex), No Relations (NoRel) and Entity Relations (EntRel) are also annotated in the PDTB but are not considered in this description as it is assumed that there is always a relation between clauses and that entity relations are part and parcel of the pragmatic determination of the rhetorical relation The sense label to it's appropriate Class, Type or Subtype level, and the related text spans. The Source, Type, Determinacy and Scopal Polarity attributions of the arguments are also given in the PDTB annotation but are not included in the description herein.
As mentioned Section 1.0, Pitler et al. (2008) report results for the four PDTB Class senses and, based solely on the type of explicit marker, achieves a 93.09% four-way accuracy. The fact that there is a highly systematic relationship between discourse markers and the conveyed pragmatic relationship suggests that being able to determine a rhetorical relation in the absence of the marker, i.e. based on the surface content coupled with an individual's ability to draw inferences and make assumptions about discourse structure, is a computationally difficult task.
Pitler et al.'s (2009) system relies on ten different feature sets: (1) Sentiment polarity tags between spans of text (hereinafter “Arg1” and “Arg2”); (2) “Inquirer” tags from the General Inquirer lexicon (Philip J. Stone and Dexter C. Dunphy and Marshall S. Smith and Daniel M. Ogilvie. 1996. The General Inquirer: A Computer Approach to Content Analysis MIT Press, Cambridge, Mass.—Stone et al. (1996)) which provides finer grained distinctions for polarity and some semantic and pragmatic characterizations; (3) Reference to money, percentages or numbers—potentially indicating a comparison; (4) Ranked text unigrarn and bigrams most likely associated with a given relation from the PDTB implicit training set; (5) Ranked text unigram and bigrams most likely associated with a given relation from an explicitly marked training set (TextRels corpus (Sasha Blair-Goldensohn and Kathleen R. McKeown and Owen C. Rambow 2007. Building and Refining Rhetorical-Semantic Relation Models In Proceedings of NAACL-HLT (NAACL 2007), 428-435—Blair-Goldensohn et al. (2007)); (6) Verb classifications (Beth Levin 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press. Chicago. Ill.—Levin, (1993)) and their association with different relations; (7) The first and last words of a relations arguments as well as the first three words (following Ben Wellner and James Pustejovsky and Catherine Havasi and Anna Rumshisky and Roser Sauri. 2006. Classification of Discourse Coherence Relations: An Exploratory Study using Multiple Knowledge Sources. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, 117-125—Wellner et al. (2006)); (8) The presence or absence of a modal verb, specific modal verbs and their cross-product< >; (9) Whether or not the implicit relation immediately follows or precedes and explicit relation (following Pitler et al. (2008)); and (10) Different variations of word pair models trained on the TextRels, PDTB implicit and explicit training sets—for example, word pairs contributing to the highest information gain for a given relation—the-but, of-but, to-but strongly associate with COMPARISON where the-and, a-and strongly associate with CONTINGENCY. The Stone, Blair-Goldensohn, Levin, and Wellner papers are hereby incorporated herein by reference in the entirety.
All of these features are designed to get at pragmatic information via surface text and associated semantic information. In four binary classification tasks (i.e., COMPARISON or not, etc.), the best feature combination is the use of first and last words as well as the first three words (Native Bayes). The macro-F1 for the four binary classifiers based on this feature is 34.23. Individual relation F1s are: COMPARISON=21.01; CONTINGENCY=36.75; EXPANSION=63.22; TEMPORAL=15.93. By adding different combinations of word-pair relations, performance improved for different relations in the binary classification tasks; raising the macro-F1 6% to 40.56.
Lin et al. (2009) relies on more consolidated features: (1) Contextual features focused on argument embedding between the previous, current and next arguments; (2) Syntactic constituent parses; (3) Dependency parses (using the Stanford parser (de Marneffe et al., 2006)); and (4) Stemmed word pairs from Arg1 and Arg2 in the PDTB. Both the Class and Type level of relations are predicted using these features. The best individual feature performance (OpenNLP MaxEnt) at the Class level is 30.3-32.9% for the word pairs. Combining all features returns 35.0-40.2% accuracy at the Class level. At the Type level, Lin et al.'s system was able to predict 7 of 11 relations. While the prediction of the 7 or 11 Type relations averages to a 40% micro-average, the macro-F1 is between 20.36. Zhou et al. (2010) use a combination of features from Pitler et al. (2009), Lin et al. (2009) and intra-argument word pairs Saito et al. (2006). Zhou et al.'s system makes predictions at the Class level (four linear SVMs from LibSVM (Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3), 21:1-27:27—Chang et al. (2011)). Macro-F1 is similar (40.88) is 4% better than Pitler et al.'s best single feature classifier (34.23-36.24) and 2% (42.34) better than Pitler et al.'s best combined system (40.56). The Lin, de Marneffe, Chang, and Zhou papers are hereby incorporated herein by reference in the entirety.
In sum, for predicting implicit in the PDTB, the state of the art research returns macro-F1s that top out at a little more than 40% if different feature and classifier performances are combined and mid-30% for single feature set results. Further, all of the features are based on detecting semantic (and some syntactic) information on the assumption that it systematically co-varies with pragmatic rhetorical relations. Like many tasks attempting to predict the same, sensibly relying on the available text shows small incremental improvement over time, but within a window that, overall, runs counter to being able to actually use discourse structure information in downstream NLP tasks (Lin et al., 2009). The next section presents the methodology for our experiments which duplicate (and in some cases exceed) these results with significantly less (but higher dimensional) features both in terms of amount and processing effort.