A character encoding may determine how raw data is represented as textual data for purposes of processing, rendering, and/or analyzing the data as text (e.g., by mapping data values to characters). Ordinarily, in order to correctly process textual data, traditional systems for reading textual data may either assume a specified character encoding by convention (e.g., assume that any input text is encoding using a predetermined character encoding) or identify metadata attached to the textual data which specifies a character encoding for the textual data.
Unfortunately, in some cases metadata that specifies a character encoding for textual data may be absent or incorrect. Traditional character encoding detection systems may analyze the raw textual data for patterns (e.g., for recurring byte sequences used in text with known character encodings) in order to guess the correct character encoding for the textual data from among hundreds of standardized character encodings. However, these traditional character encoding detection systems may operate with substantial limitations. For example, some character encodings may use similar mappings for some characters, potentially causing false positives. In some cases, traditional character encoding detection systems may mistake a textual document using a character encoding that includes two character sets (e.g., Latin characters and Han characters) for a textual document using a character encoding with only one of the character sets (e.g., only Latin characters) but with the same mapping for that character set. Accordingly, these traditional character encoding detection systems may frequently fail to detect the correct character encoding for a textual document with multiple languages, especially when the majority of the document only uses one of the languages. Accordingly, the instant disclosure identifies and addresses a need for additional and improved systems and methods for detecting character encodings of text streams.