In general, natural language processing systems implement various techniques to analyze a natural language text sentence to achieve some level of machine understanding of text input. For example, natural language processing applications typically employ automated morphological, syntactic, and semantic analysis techniques to extract and process grammatical/linguistic features of a natural langue text sentence based on rules that define the grammar of the target language A grammar of a given language defines rules that govern the structure of words (morphology), rules that govern the structure of sentences (syntax) and rules that govern the meanings of words and sentences (semantics).
More specifically, morphological rules of grammar are rules that define the syntactic roles, or POS (parts of speech), that a word may have such as noun, verb, adjective etc. In addition, morphological rules dictate the manner in which words can be modified by adding affixes (i.e., prefixes or suffixes) to generate different, related words. For example, a word can have one of several possible inflections within a given POS category, where each inflection marks a distinct use such as gender, number, tense, person, mood or voice.
The syntax rules of grammar govern proper sentence structure, i.e., the correct sequences of the syntactic categories (POSs). Syntactic analysis is a process by which syntax rules of grammar are used to combine the words of an input text sentence into phrases and combine the phrases (constituents) into a complete sentence. Syntactic analysis is typically performed by constructing one or more hierarchical trees called syntax parse trees. For instance, FIG. 6A depicts an exemplary syntax parse tree for the English language sentence “The man broke the glass” and FIG. 6B depicts an exemplary syntax parse tree for the English language sentence “The man with black hair broke the glass”. Each syntax parse tree includes leaf nodes that represent each word of the input sentence, a single root node (S) that represents the complete sentence, and intermediate-level nodes, such as NP (noun phrase), VP (verb phrase), PP (prepositional phrase) nodes, etc, between the root and leaf nodes, which are hierarchically arranged and connected based on the syntax rules of grammar.
The Semantics rules of grammar govern the meanings of words and sentences. Semantic analysis is a process by which semantic rules are used to identify the “semantic roles” of a particular syntactic category within the sentence. For example, “subjects” are generally assigned the role of “who” (agent, actor, doer, or cause of the action, and the like), direct objects are assigned the role of “what” (patient, affected, done-to, or effect of the action, and the like), and modifiers can have a variety of roles such as source, goal, time, and the like. Semantic role labeling (SRL) generally refers to a process of assigning appropriate semantic roles to the arguments of a verb, where for a target verb in a sentence, the goal is to identify constituents that are arguments of the verb and then assign appropriate semantic roles to the verb arguments. In linguistics, the “arguments” of a verb are those phrases that are needed in a clause (sentence) to make the clause semantically complete. For example, the verb “give” requires three arguments (i) a giver (ii) a taker, and (iii) an object given. In the English text sentence “John gave the book to Mary”, the verb arguments are (i) John (the giver); (ii) Mary (the taker) and (iii) the book (the object given).
Semantic role information of sentence constituents is a crucial component in natural language processing (NLP) and natural language understanding (NLU) applications in which semantic parsing of sentences is needed to understand the grammatical relations between the arguments of natural language predicates and resolve syntactic ambiguity. Indeed, the ability to recognize and label semantic arguments is a key task for answering “Who”, “When”, “What”, “Where”, “Why”, etc., questions in applications such as machine translation, information extraction, natural language generation, question answering, text summarization, etc., which require some form of semantic interpretation.
In general, conventional SRL systems were configured to extract semantic features and assign semantic roles by analyzing the syntactic structure of sentences output from a syntactic parser or other shallow parsing systems trained using syntactic constituent data. The syntactic annotation of a parsed corpus makes it possible to properly identify the subjects and objects of verbs in sentences because certain semantic roles tend to be realized by certain syntactic categories and verb-argument structures. For instance, in the syntax parse tree of FIG. 6A, semantic analysis may identify the noun man as the “subject” of the verb broke and identify “the glass” as the object of the verb broke.
However, conventional methods of semantic role labeling based on pure syntactic parsing are problematic and not capable of representing the full meaning of sentence. These problems are due to the fact there can be significant variation in syntactic structure of arguments of predicates in a language such as English. In other words, one predicate may be used with different argument structures and one semantic representation may represent different syntactic derivations of surface syntax. In short, the difficulty in identifying semantic roles is because there is no direct mapping between syntax and semantics.
By way of example, consider the following sentences (1) “John broke the window” and (2) “The window broke”. A syntactic analysis will represent “the window” as the direct object of the verb “broke” in sentence (1) and will represent “the window” as the subject in sentence (2). In this regard, the syntactic analysis would not indicate that the window plays the same underlying semantic role of the verb broke in both sentences. Note that both sentences (1) and (2) are in the active voice, and that this alternation between transitive and intransitive uses of the verb does not always occur.
For example, consider the following sentences: (3) “The sergeant played taps” and (4) “The sergeant played”. In sentences (3) and (4), the subject “sergeant” has the same semantic role of the verb “played” in both instances. However, the same verb “played” can also undergo syntactic alternation, as in the following sentence: (5) “Taps played quietly in the background”. Moreover, the role of the verb's direct object can differ even in transitive uses, such as in the following example sentences: (6) “The sergeant played taps” and (7) “The sergeant played a beat-up old bugle.” This alternation in the syntactic realization of semantic arguments is widespread, affecting most verbs in some way, and the patterns exhibited by specific verbs vary widely.
In this regard, while the syntactic annotation of any parsed corpus makes it possible in some instances to identify the subjects and objects of verbs in sentences such as the above examples, or while the parsed corpus may provide semantic function tags such as temporal and locative for certain constituents (generally syntactic adjuncts), the parsed corpus does not necessarily distinguish the different roles played by a verb's grammatical subject or object in the above examples. Again, this is because the same verb used with the same syntactic sub-categorization can assign different semantic roles. As such, semantic role labeling is difficult using pure syntactic parsers as these parsers are not capable of representing the full, deep semantic meaning of sentence.
Recently, semantic role labeling systems have been implemented using supervised machine learning techniques to train syntactic parsers using a corpus of words annotated with semantic role labels for each verb argument. For instance, the well-known Proposition Bank project provides a human-annotated corpus of semantic verb-argument relations, where for each verb appearing in the corpus, a set of semantic roles is defined for purposes of providing task independent semantic representations that are independent of the given application. With this annotated corpus, the possible labels of arguments are core argument labels ARG [0-5] and modifier argument labels such as ARGM-LOC and ARGM-TMP, for location and temporal modifiers, respectively.
As an example, the entry specific roles for the verb offer are given as:    Arg0 entity offering    Arg1 commodity    Arg2 price    Arg3 benefactive or entity offered to
The roles are then annotated for every instance of the verb appearing in the corpus, including the following examples:                [ARG0 the company] to offer [ARG1 a 15% to 20% stake] [ARG2 to the public];        [ARG0 Sotheby's] . . . offered [ARG2 the Dorrance heirs] [ARG1 a money-back guarantee];        [ARG1 an amendment] offered by [ARG0 Rep. Peter DeFazio]; and        [ARG2 Subcontractors] will be offered [ARG1 a settlement].        
A variety of additional roles are assumed to apply across all verbs. These secondary roles can be considered as adjuncts, rather than arguments. The secondary roles include: Location, Time, Manner, Direction, Cause, Discourse, Extent, Purpose, Negation, Modal, and Adverbial, which are represented in PropBank as “ArgM” with an additional function tag, for example ArgM-TMP for temporal.
A set of roles corresponding to a distinct usage of a verb is called a roleset, and can be associated with a set of syntactic frames indicating allowable syntactic variations in the expression of that set of roles. The roleset with its associated frames is called a Frameset. A polysemous verb may have more than one Frameset, when the differences in meaning are distinct enough to require different sets of roles, one for each Frameset. This lexical resource provides a consistent argument labels across different syntactic realizations of the same verb. For example, in the following sentences:                [ARG0 John] broke [ARG1 the window]        [ARG1 The window] broke,the arguments of the verbs are labeled as numbered arguments: Arg0 and Arg1, and so on according to their specific roles despite the different syntactic positions of the labeled phrases (words between brackets). In particular, in the above example, it is recognized that each argument plays the same role (as indicated by the numbered label Arg) in the meaning of the particular sense of the verb broke. These phrases are called “constituents” of semantic roles. In this example, the constituent [the window] is recognized as the verb's object in both sentences.        
In the following example sentence, “Mr. Bush met him privately, in the White House, on Thursday”, functional tags are assigned to all modifiers of the verb “met”, such as manner (MNR), locative (LOC), temporal (TMP):
Re1: met
Arg0: Mr. Bush
Arg1: him
ArgM-MNR: privately
ArgM-LOC: in the White House
ArgM-TMP: on Thursday
Recently, techniques have been proposed for automatic semantic role labeling on English and Chinese texts using parsers trained on a corpus of manually annotated semantic roles labels. For English language text, the input to the SRL system is a sequence of white-space delimited words, where each verb is presented by a white-space delimited word and a constituent is presented as a sequence of white-space delimited words, and where punctuations and special characters are assumed to be separated from the words. The proposed SRL systems are configured to predict a semantic role label for each white-space delimited verb and each constituent (sequence of white space delimited words). For Chinese text sentences, the proposed SRL systems are configured to process the input text sentence at the character level.
The ability to implement automated semantic role labeling systems for languages with high morphology such as Hebrew, Maltese, German, Arabic, etc., is highly problematic. For instance, Arabic is a Semitic language with rich templatic morphology where an Arabic word may be composed of a stem (consisting of a consonantal root and a template), or a stem plus one or more affixes (prefix or suffix) attached to the beginning and/or end of the stem. These affixes include inflectional markers for tense, gender, and/or number, as well as prepositions, conjunctions, determiners, possessive pronouns and pronouns, for example. In this regard, Arabic white-space delimited words may be composed of zero or more prefixes, followed by a stem and zero or more suffixes.
This complex morphology of Arabic and other languages present challenges with respect to natural language processing applications, and SRL approaches employed for English and Chinese texts, which process input text at the word or character level, are not necessarily extendable to such complex morphological languages. Indeed, since Arabic white-space delimited words, for example, may be composed multiple prefixes, a stem, and multiple suffixes, important morphologic information can be missed if Arabic text is processed at the word or character level such as for English and Chinese, resulting in poor performance.