1. Field of the Invention
The present invention generally relates to proofreading. More specifically, the present invention relates to detecting real word typos.
2. Background of the Related Art
Finding and fixing contextually incorrect and contextually inappropriate words is an important task in publishing, optical character recognition (OCR), end user word processing, translation, speech recognition and other industries. Contextually incorrect and contextually inappropriate words are referred to as real-word typos or typos for short. The use of word processing software allows for such errors in typing and data entry to arise. Such errors may be exacerbated when word processing takes place in conjunction with speech recognition, optical character recognition (OCR), translation software, etc.
Typos can still be found in newspapers, magazines, and books that are published nowadays. Emails, Wikipedia articles, and other electronic documents produced by end user word processing systems often contain typos, sometimes regrettable and embarrassing ones (e.g., “Thank you for your massage” instead of “Thank you for your message”). Documents produced by OCR may be prone to typos. Manually and automatically translated documents are prone to typos as well. The Internet is also full of typo collections from printed and electronic media.
Apart from introducing potential embarrassment, loss of reputation, and introducing inaccuracy and difficulty for readers, typos may result in monetary expenses for publishing, translation, and OCR service providers. These expenses may include hiring human proofreaders, editors, and/or style correctors. Expenses may further include issuing/publishing errata and apology/explanation letters. Typos also inflict additional timing constraints in time-critical services, such as newspapers where time must be allotted for proofreading articles already written on short deadlines. While online articles may be edited after publication, such articles (and its typos) may have already been copied and propagated by other news channels and blogs, which makes it extremely difficult, if not impossible, to correct all iterations of the typos.
For human, proofreading is a time-consuming task because the entire document or book has to be read even if few typos are present. In addition to being time-consuming, proofreading can sometimes also be very hard for the human eye and brain. Certain “stealth” typos may be especially hard to detect and therefore frequently escape notice.
In addition to human editors/proofreaders, computerized spell-checkers and grammar-checkers may also be used. Computerized spell-checkers and grammar-checkers are generally dependent on dictionaries or rules stored in a database. Such checkers, however, may fail to detect errors in word choice or usage. For example, a word may be spelled correctly according to some dictionary entry and used properly with respect to the rules of grammar, yet still represent an error. As such, spell- and grammar-checkers are unable to detect such an error. While human editors may be able to correct errors, they are generally unable to search for errors as quickly or as comprehensively as computerized checkers.
One presently available method for detecting such real word typos is context-sensitive spelling correction (CSSC), which focuses on computer-fixable errors and relies on stored “confusion sets” of words. “Confusion sets” include sets of words that have been previously identified by a human editor as being a common typo. For example, “there,” “their,” and “they're” may be included in a confusion set. Such terms may be inadvertently or deliberately interchanged by a typist or a speech recognition program. CSSC identifies the words in the confusion sets that are present in a text. CSSC then determines, based on context, whether or not the correct word in the confusion set is used in the text. When a phrase belonging to a confusion set is found in the text, the phrase is suspected of being a typo. Phrases from the related confusion set may be automatically sorted by likelihood of being a typo in the specific context of the confusion set. The phrase from the confusion set that has the lowest likelihood of being a typo is selected. If the selected phrase is not the one that is currently being parsed, a typo is declared to be found in the parsed phrase. The typo is automatically corrected by replacing the parsed phrase with the selected phrase.
CSSC therefore has limitations as a model. Firstly, it focuses on typo correction in addition to typo finding. As a result, CSSC solutions inherently focus on and are designed for finding only those typos that can be corrected by computer. This in turn narrows the number of typo types to be found. Secondly, CSSC implies and relies upon the concept of confusion sets, which further limits the number of typo types that can found.
There is therefore a need for improved systems and methods for detecting real word typos that do not rely on a set of confusion lists and find not only limited set of confusion typos but almost any type of typos.