Computer word processing, using computer programs commonly called text editors or word processors, is a part of most people daily lives. Many people are writing in multiple computer programs, such as e-mails, instant messengers, chats, etc. People are commonly making writing mistakes, which can be pure technical mistakes, like typing a wrong letter, or mistakes originating from poor language knowledge, or deficient literacy or learning disabilities like dyslexia.
Spelling correction, either automatic or explicitly requested by user, is a common event in word processing. Spellers (spelling software programs or program components, servers, hardware devices, etc) are either correcting the misspelled words or suggesting one or several correction candidates by testing each written words against a dictionary of the known words. If a speller finds a written word, which is not in the dictionary, it tries to suggest candidate words taken from the dictionary, which are “closest” to the written word and normally differ from it in 1-2 letters. Most advanced spellers are suggesting as candidates also words, which are very different from the misspelled words, but are pronounced very similarly, performing also so-called phonetic spell checking.
Conventional spellers fail to correct or propose the right correction, when a written word contains several mistakes, which make it too “distant” or “un-recognizable” from any word in the dictionary. This is what happens, when conventional spellers attempt to work on texts written by dyslectics, for example. Yet another problem is, when a word is spelled correctly, but the word is wrong regarding the context of the specific sentence, being a confused word. Phonetic spellers are also not helpful and not detecting those confused words, which are “homophones” (the words pronounced similarly, however have completely different meanings). For example, in the sentence “I would like to meat a friend” appears the word “meat” instead of “meet”, conventional spellers and spelling techniques will not recognize the “homophone” problem and will not fix it or propose any corrections.
Lots of commercial text-processing tools, editors and word processors have been examined and reviewed in details in a recently [1], where all the tools were found not of much assistance to correct severe spelling errors, e.g. of the kids and adults with dyslexia. This review does not cover rather recent Microsoft Office 2007, claiming “contextual spelling” capabilities. Since there are no yet published data exploring the facility, we have examined the MS-Word-2007with context spelling ourselves, using sentences mostly from the public sources [1]. MS-Word-2007speller corrects about 50% of the errors, whereas the invented by us approach brings the correction level to 90% as can be seen at $www.ghotit.com embedded speller, when following the web-site's spelling instructions. There is no any information published yet about the techniques used by Microsoft in MS-Office 2007contextual spelling.
In order to provide a real solution to the above problems a combination of spellers with the context meaning of the text, which is commonly known as context sensitive spelling or context spelling [2], is required. Classical prior art like U.S. Pat. No. 5,956,739 or references [2] and [3] are suggesting context-spelling approaches that require prior or in-time training of the spellers on a so-called text corpus, which is a large and structured set of documents on the topics used by the writer. Such corpus is commonly used by context-sensitive spellers to generate an index of words, which are commonly used within the same context. The technology fails on rather short texts or a text with a novel subject, which is not a part of the corpus documents, used to train the speller; the technology also requires huge processing time and consumes a lot of computing resources like memory and CPU.
People of different ages and different professions and hobbies are writing texts of different subjects and contexts, using different words and even slang. Even a very huge corpus cannot combine all spectra of human life and be dynamic enough to cover all types of contexts. WWW contains all varieties of texts, represents all types of subjects, slang and is constantly updated.
U.S. Pat. Nos. 6,401,084, 7,050,992, 7,194,684, 7,254,774, US Patent Applications 2005/0257146, 2006/0161520 and 2007/0106937 are describing usage of user search query logs for spelling correction of the queries and even for word-processing documents. In many cases the logs number of occurrences can help to propose candidate words for correction and to provide a certain context for spelling. The deficiency of the prior art is that user queries are mainly limited to only two-three words, which may be in many cases insufficient to provide a context depth for a regular sentence and does not enable the adoption of the powerful context spelling approaches like N-grams, usage of grammar binding and POS-tagging.
U.S. Pat. Nos. 6,618,697, 6,349,282 and many others are teaching usage of N-grams for spelling corrections. The techniques may be successful for speech recognition or OSR, but are failing at heavy dyslectic texts, where each word of a phrase may be misspelled or confused, comprising from actually only illegal N-grams. Such techniques of N-grams usage are potentially pushing out the correct, but statistically less commonly used words and phrases, and making spelling correction in many cases less successful, than the corrections of conventional spell checkers. Yet other deficiencies of many other techniques involving N-grams are usage of a not large enough N-grams database, which seriously deteriorates quality of text correction.
Prior art like DE 102005026352 and US Patent Application 2002/0194229 describes usage of Internet for a word spelling correction by generating a number of possible candidates for a misspelled word, making Internet search by a search engine for the word and its candidates, and taking the candidates with maximum Internet occurrences as the best candidates for spelling correction. The art also describes usage of a network database to store the results and their caching with timely updates. The art also does not teach any context aware spell checking by using N-grams database, usage of grammar binding and POS-tagging as well as usage of the methods in combination with editing and phonetic distances.
Other prior art, described in WO 2007/094684, presents usage of context sensitive spell checking for Optical Character Recognition (OCR) process for confirming uncertainly recognized words by searching the words correction alternatives in combination with one or more previous and subsequent words as the queries to an Internet search engine. The word candidates with the top occurrences of the text strings in Internet are the best candidates to be used. The art also does not teach any context aware spell checking by using N-grams database, usage of grammar binding and POS-tagging as well as usage of the methods in combination with editing and phonetic distances.
The art describes performing a look-up against an Internet search engine from a client directly and as a near real-time process without any caching of results, database or in-memory database involved. Such a process may be acceptable for OCR, where uncertainly recognized words are rather rare, whereas usage of such approach for word-processing spelling with a large number of possible mistakes and high number of potential candidate words to consider is practically not possible due to the following factors:                Search engine abusing;        Heavy network load;        High response time, where the near instant spelling correction is required.        