The exemplary embodiment relates to document processing and finds particular application in connection with a system and method for automatic generation of text sequences.
Automatic text generation finds application in assisting authors to create text by proposing a next word or phrase based on the text which has already been generated and historical data. This could save the author time by reducing the amount of typing needed. However, such systems may propose words or phrases which are not what the author intends or which do not fit the author's style of writing. Thus, they may be frustrating to the author or provide limited or no reduction in the typing time.
Automatic generation of whole natural language documents has been proposed using a global, rule-based planner which decides on the general outline of the text to be produced. A purely statistical approach, based on a Markovian assumption, is to sample the next words, based on the distribution of the last n−1 produced words (p(w|w1 . . . wn-1)). See, for example, Zach Solan, et al., “Unsupervised learning of natural languages,” Proc. Nat'l Acad. Sci. (PNAS), 102(33): 11629-11634 (2005).
Such an approach does not generally produce coherent text, except in a few rare cases, and is mostly used as parody.
The obtained text has the property that locally the phrases seem to make sense, but the overall result is nonsensical. As in other n-gram approaches, a potential solution is to enlarge the left context (n). This makes this particular property less acute, but has the drawback that the resulting text is almost an exact copy of one of the sequences in the original corpus, due to the lack of alternatives in the original corpus where substring statistics were computed.
There remains a need for a system and method for generation of textual documents given an input text sequence which provides more useful outputs.