The invention relates generally to a method for determining an encoding used for a sequence of bytes. The invention relates further to an encoding detection system for determining an encoding used for a sequence of bytes, and a related computer program product.
Since the beginning of the computer age, many encoding schemes for characters and symbols have been created to represent various writing scripts/characters for computerized data. With the advent of globalization and the development of the Internet, information exchanges, crossing both, language and regional boundaries are becoming ever more important. Beside Unicode, which is designated as the default encoding to provide convenient and unified communication, other different character sets or code pages co-exist for their own purpose. With a known code page or character set, information and content can be processed properly.
But there is still a big amount of content with unknown or incorrect code page or character set indicators. The value of these contents can be discovered when they are processed with the proper or right code page. There are several approaches available to detect the correct encoding for documents with unknown/incorrect encoding. They all have various strength and weaknesses. One family of approaches uses machine learning but requires training and has limited quality when it comes to detecting differences between related encodings. Other approaches are using dictionaries as a way to test every possible code page or character set for given code points. Although this is possible to find out an appropriate character set, it is expensive. In addition, for some multi-byte encodings such as EUC-CN and EUC-KR, they share almost identical coding points and it is very hard to distinguish among such encodings with this method.
There are several such families of code pages that differ for a small set of often rarely used characters (e.g., the family of the Latin encodings). While many algorithms exist that can identify code pages or character sets, they often only get the family of the code page correct but have systematic errors in finding the correct encoding down to the exact family member.