This disclosure relates generally to the field of language globalization, and more particularly to real-time web content correction.
Word corruption is one of the most expensive and challenging problems in managing web-based content that includes double-byte or multi-byte characters, for example Chinese (Simplified and Traditional), Japanese, and Korean (CJK), encoded in different coded character sets. Word corruption is a language text presentation problem when web applications or other text rendering applications, such as eBook readers, tablets, or smart phones render text under incorrect character encoding environments. Word corruption is often seen when text is moved between computers having different default encodings. If the encoding is not specified, it is up to the software, for example the operating system or application, to use another means to attempt to render the text correctly. This may include an operating system setting or charset detection, which uses statistical analysis of byte patterns to determine character encoding.
Word corruption can occur in two major categories. The first category, unreadable webpage content, is caused by incorrect or inconsistent lang, script, and charset settings in the header and metadata of the webpages. The second category, corrupted data, occurs when invalid bytes change the string hex sequence, such as data in a file, during processing, transferring, or storing the data. In the corrupted data type of word corruption, an invalid byte is an additional byte or a missing byte in a double-byte or multi-byte character. The first category of word corruption may temporarily inconvenience a user viewing a webpage or mobile device. However in corrupted data word corruption, invalid bytes in hex data strings may alter the meaning of the contents of a file, such as a text document.