In natural language processing applications, such as translation between languages, one issue that arises is to break the input text into sentences. The general rule for breaking sentences is typically very simple—e.g., a word that ends with a period is the end of a sentence. However, the general rule is often swallowed by exceptions that can be both numerous and complex. For example, in English the string “Dr.” ends with a period, but often is not the end of a sentence because it is the abbreviation for the title “Doctor.” Strings like “U.S.”, “etc.”, and “A.D.” might or might not be the last word in a sentence. In some cases, determining whether a string is, or is not, the end of a sentence involves looking at the words that precede and/or follow the word in question.
The rules to recognize the end of a sentence, as well as the exceptions, are often specified as regular expressions. Thus, in English the regular expression for the general end of sentence rule could specify that a matching string is one or more characters followed by a period (e.g., “.+\.”, in a typical regular expression syntax). Additionally regular expressions can be used to specify the exceptions—e.g., “Dr\.” to match the abbreviation for “Doctor”, or “etc\. (?=[a-z])” to match “etc.” when followed by a word beginning with a lowercase letter. One approach to using the general rule and the exceptions together is to find a match for the general rule, and then evaluate the match to determine whether there are matches for the exceptions. If there is a match for the general rule and for an exception, then the matching string is not the end of a sentence. If there is a match for the general rule but not for an exception, then the matching string is the end of a sentence. A problem with this technique is that it involves evaluating all of the exceptions for each string that matches the general rule. Thus, the number of operations involved in finding the sentence breaks may be proportional to the number of matches on the general rule times the number of exceptions, which is inefficient.