In corpus linguistics, part-of-speech (POS) tagging is the process of marking up the words in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph. It is a necessary pre-processing step for many natural language processing (NLP) tasks. As POS tags augment the information contained within words by explicitly indicating some of the structures inherent in language, their accuracy is often critical to down-stream NLP applications. For example, in concatenative text-to-speech (TTS) synthesis, POS tags are heavily relied upon in the context of prosody modeling; they greatly influence how natural synthetic speech sounds. It is therefore crucial that they be correct.
With the growing availability of NLP training resources in recent years, POS tagging has increasingly involved some forms of data-driven processing. State-of-art models based on conditional random fields (CRFs), for instance, are trained to identify the most likely sequence of tags for the observed set of words in a given sentence. These models rely on feature functions acting as marginal constraints to ensure that important characteristics of the empirical training distribution are reflected in the trained model. With well chosen functions covering sufficiently rich features of the training data, and given adequate initial conditions, CRF taggers can achieve a very high level of tag accuracy on general NLP corpora.
In some specific applications, however, such taggers may be too generic to fit the problem requirements. Most tasks involve slightly different sets of features functions, whose extraction may be impossible to perform on standard NLP collections if they have not been annotated to support it. This is the case for TTS speech synthesis, for which features typically considered in mainstream NLP are not sufficient. Conventional POS tagging for TTS therefore tends to rely on rule-based systems, which can easily be developed from smaller, special-purpose databases. Such rule-based taggers tend to be more brittle than statistical models trained on large collections.
Given a natural language sentence including L words, POS tagging aims at assigning to each observed word wi some suitable POS pi, 1≦i≦L. Representing the overall sequence of words by W and the corresponding sequence of POS by P, CRF taggers directly maximize the conditional probability Pr (P|W) over all possible POS sequences P. This is done via log-linear modeling of feature functions expressing important aspects of the empirical training distribution, as observed on a large annotated corpus. The size and pertinence of the training corpus is thus critical to the quality of the resulting models.
There is, however, an inherent trade-off between size and pertinence. Standard NLP corpora tend to be suitably extensive, but fairly generic in terms of supported tag set and associated annotation. Most of them use the default Penn Treebank POS tag set, which is not optimal for a TTS synthesis application. For example, in the sentence:                She is coming tomorrow, she is, she really is!        
The three instances of the word “is” would normally be assigned the same tag (e.g., VBZ). Yet, they are realized three different ways. The first instance is unaccented and reduced; the second one is accented; and the third one is unaccented but with full vowed quality. Any synthetic version not respecting these rendition patterns would not sound natural. It thus stands to reason that a TTS system would benefit from a POS assignment system which reflects such distinctions. At the very least, the first instance of “is” should be assigned a POS that typically carries no accent, such as auxiliary, and the second a POS that typically carries an accent, such as (non-modal) verb.
The problem is that special-purpose corpora created with such specific application in mind tend to be too small for the reliable estimation of CRF parameters. This is why POS tagging for speech synthesis typically relies on rule-based taggers. They can easily take into account the kind of distinctions exemplified in a typical statistical model POS tagger, including the case of the third instance of “is”, which is clearly very specific to the application at hand. On the other hand, they suffer from several potential drawbacks, including lack of portability, maintenance difficulties, and the risk of over-generalization from a small number of exemplars.