1. Field
The present invention relates generally to computer security functions, such as authentication and encryption in computers and networks, and more particularly, to an apparatus and method for cryptographic operations using enhanced knowledge factor credentials.
2. Description of the Related Art
Network devices and other types of computers typically authenticate the identity of a user by verifying one or more of three factors: knowledge, possession, and inherence. Knowledge factor authentication relies on something the user knows, such as account credentials, to authenticate the user before authorizing access to computer and/or network resources. Such credentials, in the form of a username and password, have been used extensively since the early 1960s. Since the advent of credential-based security, interlopers have exploited various weaknesses in these systems, a problem which has reached epic proportions today. Because the vast majority of systems today are protected by single-factor authentication of credentials, the flaws inherent in this scheme have left many systems vulnerable to attack by hackers and criminals.
Typical credential-based security has competing requirements that the password be (1) sufficient complex and long to prevent guessing and (2) memorable enough so that a unique password is used for each system. In practice, users have difficulty meeting both requirements, and tend to use the shortest and simplest password they can remember for most of their accounts. Hackers have caused incalculable damage by exploiting this behavior.
A major shortcoming of credential-based security is the inability to distinguish an attacker who possesses the credentials from the real user. A hacker who can guess or otherwise discover the credentials can compromise the system and freely access the protected data and/or resources, often without detection. Therefore, attackers have spent considerable effort developing creative ways to determine a user's credentials. Many users simply choose a password that is too easily guessed, and tend to use the same password in many systems, compounding the danger of compromise.
Some systems known in the art use a knowledge factor credential to generate or derive an encryption key used to encrypt data in transit, at rest, or both. Such systems are also vulnerable because compromise of the credential can permit an attacker to decrypt the encrypted data and thereby gain unfettered access to the underlying data. As used herein, credential-based security systems include any computer or network security function, e.g., authentication and encryption, which relies at least in part on knowledge factor credentials for their security.
One drawback of conventional credential-based security systems is the relatively small set of characters that are allowed in each credential, which is referred to as a “password” herein. Historically, passwords have been limited to a subset of the American Standard Code for Information Interchange (ASCII), which was created in the 1960s for teletypes. ASCII was based on the English language and originally encoded 33 control characters and 95 printable characters as 7-bit integer numbers. Computers, however, typically store data in 8-bit units called bytes or octets (and multiples thereof, e.g., 16-bit words, 32-bit double words, and 64-bit quad words). An ASCII character is conventionally encoded in a byte using the least significant seven bits, where the most significant bit (MSB) is zero. For example, the number ‘5’ is encoded as 00110101 in binary, or 53 decimal (0x35 hexadecimal). By convention, hexadecimal numbers herein will be prefixed with “0x” or otherwise denoted as such. FIG. 1 is the ASCII table, with the decimal and hexadecimal value for each ASCII character. The ASCII control characters have the code values 0 through 31 and 127 (decimal) and are not usable in many conventional credentials.
Some credential-based security systems known in the art allocate one byte (8 bits) for each password character and allow only a subset of the 95 printable ASCII characters in the password. For example, many systems do not permit the space (ASCII character 32), and certain other punctuation characters. Therefore, an attacker need only test permutations of the allowed characters, disregarding the bytes representing the ASCII control characters and the disallowed printable ASCII characters. There are mn possible permutations of passwords that are m characters in length formed from a set of n characters, absent restrictions on reuse or repetition of characters. Accordingly, each impermissible character reduces exponentially the number of possible combinations, thereby reducing the strength of conventional security systems. Even those security systems that permit all 95 printable ASCII characters in the password are still significantly disadvantaged because a byte can represent 28, or 256, values, and in these systems, there are 161 byte values that never occur in a password (0-31, 127, and 128-255 decimal). In other words, most passwords known in the art contain at most only 37.1% (95/256) of the possible byte values. Hackers take advantage of this relative paucity of values to mount efficient “brute force” password attacks, in which computers try every permutation of passwords containing the permitted password characters. Another common attack relies on “rainbow tables” of precomputed hashes. If an attacker can obtain an account database of password hashes, and the cryptographic hash function used in the security system is known, a table of hashes using the same algorithm can be generated from billions of passwords, which attackers compare to the purloined hashes. If a match is found, the cryptographic hash function has been effectively reversed and the password discovered.
Other state-of-the-art credential-based security systems expand the set of allowable password characters to those in a “code page.” A code page is a set of single-byte characters in which the lower set of 128 characters (MSB of zero) are usually the same as in ASCII, and additional characters in the upper set of 128 characters (MSB of one). There are 128 such upper values in a byte, which are typically used for accented characters found in languages other than English and other punctuation, currency, and mathematic symbols. Microsoft® Windows® is an example of an authentication system that allows password characters from several sources, including the active code page and the Windows 1252 code page. FIG. 2 shows the characters in code page 437, which was included with the original IBM PC, and is the active code page when the Microsoft Windows locale is set to United States. There are hundreds of code pages, particularly to represent languages other than English, each of which is incompatible with the others. Although the use of characters from code page 437 for Windows passwords theoretically increases the strength of Windows passwords, in practice very few users take advantage of the additional allowable characters for several reasons. First, virtually no Windows users are aware of this capability, and few users are familiar with the “Alt-codes” that are necessary to enter characters not found on the keyboard. For example, to include ‘{tilde over (e)}’ in a code page 437-based password, the user presses the ‘Alt’ key while entering the number 138 on the numeric keypad. To use a character from the Windows 1252 code page, the user can use an alternate form of the Alt-code, in which a leading zero before the number. For example, to select the euro sign C=, the user presses the Alt key while entering the number 0128 on the numeric keypad. This is cumbersome because the user has to memorize the Alt-codes for two different code pages, and as a result, this facility is rarely used. Another reason why Windows passwords do not typically contain extended characters from code pages is that they are only usable on computers with the same active code page. Moreover, such passwords are incompatible with other systems that use a different code page or yet another character encoding scheme and not all credential-based security systems will accept characters with the numeric value 0 to 31 and 127 because of their assignment as nonprintable control codes in ASCII.
A major problem of the hundreds of code pages and disparate character encoding schemes in use is incompatibility. For 8-bit character encoding schemes, a particular character, especially in the upper set of 128 characters, might be encoded differently, or worse yet, not be encoded at all. The result is that the vast majority of conventional credential-based security systems only allow the 95 printable ASCII characters, and more commonly only a subset thereof, because those characters are generally universally available as the lower set of 128 characters in the code page, even in countries or locales where the native language is not English.
In the late 1980s, the Unicode Consortium developed a “unique, unified, universal” character encoding system called Unicode®, which endeavored to encode every character in all scripts used in the world's writing systems. When the Unicode standard was first published in October 1991 (version 1.0), its designers believed that every modern script could be encoded in fewer than 65,536 characters and thus chose to encode Unicode characters as 16-bit integer values called “code points,” with a code space of 0x0000-FFFF. The first 256 code points (0x0000-00FF) correspond to the encodings in the ISO 8859-1 8-bit code page, of which the first 128 characters (0x00-7F) are identical to ASCII. The Universal Coded Character Set (UCS) defined in International Standard ISO/IEC 10646 contains the same character encodings as Unicode. Later versions of Microsoft Windows permit the inclusion of certain Unicode characters in the Windows password through a third type of Alt-code, in which the user presses the Alt key, then the + key on the numeric keypad, then the four digit hexadecimal value of the code point using the numeric keypad. Unfortunately, Windows down-converts some Unicode characters selected in this manner to ASCII characters, making this input method inadequate. Because this technique does not display Unicode characters for the user to select, it is not user-friendly and even more rarely used than selecting characters from code pages with other forms of Alt-codes.
In 1996, Unicode 2.0 expanded the code space to the range 0x0000-10FFFF, with 1,114,112 available code points, divided into seventeen planes, numbered 0 to 16, each containing 65,534 code points. The original set of 16-bit Unicode characters is Plane 0, and was renamed the Basic Multilingual Plane (BMP). Presently, only five other planes are in use: Plane 1 (Supplementary Multilingual Plane, 0x10000-1FFFF), Plane 2 (Supplementary Ideographic Plane, 0x20000-2FFFF), Plane 14 (Supplementary Special-purpose Plane, 0xE0000-EFFFF), and Planes 15 and 16 (Supplementary Private Use Areas A and B, 0xF0000-FFFFF and 0x100000-10FFFF, respectively). Each plane is divided into blocks, which are always a multiple of 16 code points in size, and are uniquely named. For example, the first block in the BMP, containing 128 code points from 0x0000 to 007F, is named Basic Latin. Unicode 8.0 contains 262 blocks. The most recent version, Unicode 8.0, encodes over 120,000 characters, with capacity for over 800,000 more.
A key design principle of Unicode is that code points uniquely represent abstract characters, not glyphs. Glyphs are visual graphic forms suitable for rendering or printing, with many possible variants and styles for the same character. For example, within a typeface, e.g., Times New Roman, there are many fonts, such as 12 point regular, where each font is a set of glyphs. Encoding the English alphabet as characters requires 26 uppercase and 26 lowercase characters, but there are thousands of glyphs within the many fonts that represent those characters in print and electronic display. Unicode defines abstract characters that are rendered into glyphs, but does not define the appearance of the glyphs themselves, which can vary by typeface and other factors.
The Unicode repertoire also contains thousands of symbols used in mathematics, science, music, and games as well as characters for emoji, a relatively new form of non-verbal communication that uses small, iconic graphics to convey emotions (emoticons) and other symbols. Unicode contains many characters encoding other elemental glyphs that alter the spelling, meaning, pronunciation, or accentuation of base characters, such as Latin letters. For example, diacritical marks (characters) are glyphs that are added to a letter character, such as the ´(acute), {tilde over ( )}(grave), and {umlaut over ( )}(diaresis) accents. Diacritical marks are but one type of combining mark (also called combining characters and non-spacing marks) in Unicode. Nearly all of the characters in Unicode are graphic characters, that is, characters that either represent a glyph, or cause a visible spacing between glyphs. There are 120,520 graphic characters in Unicode 8.0.
There are 152 format characters in Unicode 8.0, which are invisible but affect the appearance or combination behavior of adjacent graphic characters. For example, the ZERO WIDTH JOINER (ZWJ) format character (by convention, code points are referenced by “U+” followed by their hexadecimal value, here U+200D) is used to indicate that two characters should be connected when ZWJ is placed between them, particularly in Arabic and Indic scripts. Conversely, the ZERO WIDTH NON-JOINER (ZWNJ) format character (U+200C) when placed between two characters indicates that those characters should not be connected, e.g., to break a cursive connection in Arabic scripts. ZWNJ can also be used to prevent ligatures from forming, for example, in certain German words. Many of the Unicode format characters affect the behavior of word, line and paragraph breaks, and other formatting.
There are 65 control codes in Unicode, inherited from the encodings in ISO 8859-1, ranging from U+0000 to U+001F and from U+007F to U+009F. Additionally, there are 66 values never used for encoding characters in Unicode: from U+FDD0 to U+FDEF in the BMP and any code point ending in the values 0xFFFE or 0xFFFF.
Unicode also defines three Private Use Areas (PUA): 6400 code points in the range U+E000-F8FF in the BMP and the entirety of planes 15 and 16, for private-use characters, in which software developers and end-users can define their own code points. In some cases, private-use characters defined within the PUAs are used by a single organization, but they may be adopted more broadly by the Unicode community. Because Unicode is an evolving character set, widely-used private-use characters can be formally adopted as an assigned code point.
There is not a one-to-one correspondence between Unicode characters and glyphs because what a user perceives as a single glyph can be comprised of several elemental glyphs, each encoded as a separate Unicode character. What a user perceives as a single character is known as a grapheme, defined by Unicode as “a minimally distinctive unit of writing noninfringement the context of a particular writing system.” One of the guiding precepts of Unicode was the ability to combine multiple abstract characters into one grapheme, which will generally be rendered as one glyph image. In general, any base character can be combined with any arbitrary sequence of combining marks into a combining character sequence (CCS). A CCS is formally defined as a base character followed by one or more (1) combining marks; (2) ZWJs; or (3) ZWNJs, which are combined into one glyph image by the rendering engine. Unicode also contains many precomposed forms, in which a single code point encodes a base character (or two, a digraph) with one or more diacritical marks for the most common combinations, such as {umlaut over ({hacek over (u)} (LATIN SMALL LETTER U WITH DIARESIS AND CARON, U+01DA). It is always possible to represent the same glyph using the code points for the base character and combining marks. For many glyphs, there are two or more sets of code points that are canonically equivalent, that is, even though the constituent code points and/or their sequence differ, they represent the same displayed glyph. For example, there are three canonically equivalent code point sequences for the glyph Å: (1) U+0005 (LATIN CAPITAL LETTER A WITH A RING ABOVE); (2) U+212B (ANGSTROM SIGN); and a combining sequence of two code points: U+0041 (LATIN CAPITAL LETTER A) and U+030A (COMBINING RING ABOVE). The latter sequence is an example of the decomposed form of a glyph, i.e., a CCS that contains the maximal length character sequence for a given glyph. The combining character sequence is one form of grapheme cluster, which is an atomic, user-perceived character (grapheme) comprised of a base character followed by one or more non-spacing marks. Other examples of grapheme clusters are sequences of Hangul Jamo characters to represent Hangul syllables in Korean, and Indic consonant clusters.
The Unicode Standard defines two types of character equivalence: canonical equivalence, and compatibility equivalence. As described above, canonical equivalence is exact equivalence between characters or sequence of characters, whereas compatibility equivalence is weaker and may result in subtle visual differences between the resulting glyphs. Unicode further provides formally-defined normalization forms for both composed and decomposed sequences of code points, and rules to guarantee canonical equivalence between them. Normalization Form D (NFD) performs a canonical decomposition of each user-perceived character in a Unicode string by expanding it into its decomposed character components, and placing any combining marks into a well-defined order. Normalization Form C (NFC) performs a canonical composition by first performing NFD, then processing the fully decomposed and canonically ordered string by searching for pairs of characters that can be replaced by canonically equivalent composite characters, resulting in a fully composed but canonically equivalent string. Unicode also provides normalization forms for compatibility equivalence: Normalization Form KC (NFKC), which performs compatibility decomposition followed by canonical composition, and Normalization Form KD (NFKD), which performs compatibility decomposition. The Unicode Standard includes normalization charts, which define the NFC, NFD, NFKC, and NFKD normalizations.
Unicode defines three Character Encoding Forms (CEF), which map from the set of integers (code points) encoding characters to code units, which are integers occupying a specified binary width in a computer architecture, such as an 8-bit byte or a 16-bit word. The supported CEFs are Unicode Transformation Format (UTF)-8, -16, and -32. UTF-8 (8-bit) and UTF-16 (16-bit) are variable-length encoding forms because one or more code units are necessary to represent the full range of code points. UTF-8 requires one, two, or three code units to represent code points in the BMP, and four code units to encode code points in the supplementary planes. UTF-16 requires one code unit to encode code points in the BMP, and two code units to encode code points in the supplementary planes using 2,048 surrogate pairs, where the high surrogate H and low surrogate L define the code point according to the following equation: 1000016+(H−0xD800)×0x400+(L−0xDC00)=code point value. High surrogates range from 0xD800-DBFF and low surrogates range from 0xDC00-DFFF. Surrogate pairs are another form of grapheme cluster. UTF-32 is a fixed-length 32-bit code unit that encodes any code point directly in a single code unit.
There is an additional encoding scheme in Unicode, the Character Encoding Scheme (CES), which defines a reversible transformation of sequences of code units (one of the UTF-8, UTF-16, or UTF-32 CEFs) to serialized sequences of bytes. There are seven supported character-encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. The BE and LE variants refer to Big Endian and Little Endian, respectively, defining how the processor architecture maps multi-byte integers to memory locations. In Big Endian architectures, the most-significant byte is stored at the lower address, while in Little Endian architectures, the least-significant byte is stored at the lower address. The UTF-16 and UTF-32 character encoding schemes require a byte order mark (BOM) (code point U+FEFF) at the head of the byte stream to unambiguously define the byte order of the code units that follow.
Existing conventional knowledge factor security schemes typically store the username or account in plaintext form (i.e., not encrypted) in a database field configured as the primary key for a table of users or accounts. In such systems, the password, comprised of 8-bit characters, is typically input to a cryptographic hash function, such as MD5 or SHA-1. The message digest, or hash, output from the hash function is compared to the stored hash in the account database. If they match, the user is authenticated and access is granted. As is known in the art, cryptographic hash functions used for credential-based security systems have several desirable properties: pre-image resistance, second pre-image resistance, and collision resistance. The state of the art in conventional credential-based security is the application of a key-stretching algorithm, such as RSA's Password-Based Key Derivation Function 2 (PBKDF2), published as RFC 2898 by the Internet Engineering Task Force. PBKDF2 has five input parameters: (1) a pseudo-random function taking two input parameters with a fixed output length (i.e., a cryptographic hash function); (2) a password or passphrase; (3) a cryptographically random salt; (4) the number of iterations of the algorithm; and (5) the desired length of the derived key output by the algorithm. The output of the PBKDF2 function is a derived key usable for symmetric encryption, but it can also be considered a message digest and stored for comparison to a generated digest in authenticating a user.
There exists a need, not met by the state of the art, or recognized by those of skill in the art, to replace existing knowledge factor security schemes by enhanced security systems incorporating any defined multi-byte characters such as Unicode characters, and in particular, Unicode grapheme clusters and combining character sequences, in knowledge factor credentials, e.g., usernames and passwords. Further, there is a need to address various usability issues that arise when users express credentials in encodings with relatively large character sets. These and/or other shortcomings of traditional techniques are addressed below.