The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Dependency parsing is typically modeled as a pipeline of independent tasks: (1) tokenization, (2) part-of-speech (POS) tagging, and (3) parsing. Tokenization involves partitioning an input string of characters into a set of tokens, e.g., words. POS tagging involves assigning a POS tag to each token. Parsing involves determining a syntactic head of each token and building a parse tree that represents relationships between tokens. Tokenization can be difficult for languages that do not use white space to separate words (Chinese, Japanese, Korean, etc.) and for languages that have a rich morphology (Arabic, Hebrew, Turkish, etc.). Moreover, if a fixed part-of-speech (POS) tagging is selected and then treated as a ground truth by the parser, there can be POS tagging errors that the parser cannot correct. Rather, the fixed POS tagging may have errors, and these errors can propagate through the pipeline and cause errors at the parser.