Encryption with International Character Sets

To understand the principles of encryption with international character sets, it is important first to grasp a fundamental principle about encryption and decryption.

Encryption and decryption functions operate on bit strings, not characters

All encryption and decryption operations with modern encryption algorithms like Triple DES, AES, Blowfish, IDEA and so forth operate on a bit string not on characters.

Two fundamental factors make life difficult. First, almost all encryption functions expect you to pass the data as a contiguous sequence (array) of 8-bit bytes (more accurately known as octets). Second, if you only use the US-ASCII character set on your system, then one character is equivalent to one byte and the distinction between characters and bit strings is further blurred. You probably never even noticed the difference.

In the C programming language, you can usually just treat character strings of type char identically to arrays of bytes of type unsigned char (or a BYTE type in the Windows world). Thus we can do the following and get away with it:-

int encryption_func(unsigned char *, size_t);

char str[] = "Hello, world...!";
encryption_func((unsigned char *)str, strlen(str));
You can probably even miss out the cast to unsigned char without penalty.

The important thing is that the encryption function is expecting a bit string represented as an array of bytes. In hexadecimal format we would represent the US-ASCII string above as the 16 bytes:-

48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 2e 2e 2e 21
where the character 'H' is represented as 0x48 or 01001000 in binary, 'e' is 0x65 (01100101), and so on. Rather tediously for us humans, the final bit string to be encrypted is this:-
01001000011001010110110001101100
01101111001011000010000001110111
01101111011100100110110001100100
00101110001011100010111000100001

Change one single bit in that sequence and the encrypted ciphertext will be completely different. If that happens, then, when someone else tries to decrypt it, they will not understand the characters that the result is meant to represent.

The Full Process of Encryption and Decryption

The full process of encrypting a string of textual characters to ciphertext and then decrypting is as follows:-

  1. Convert the string of textual characters to a bit string, perhaps with padding. This is the 'plaintext' input to the encryption function.
  2. Use the encryption function to encrypt the plaintext bit string to another bit string of ciphertext.
  3. Store the ciphertext bit strings somewhere, perhaps as raw binary data (in a file of bytes) or as a base64-encoded string.
  4. The recipient retrieves the stored message and extracts the ciphertext bit strings - these must be in exactly the same order bit-by-bit as those created in step 2.
  5. Use the decryption function to decrypt the ciphertext bit string to a plaintext bit string
  6. Convert the resulting bit string to a readable string of textual characters.
It is important to appreciate the significance of each of the above steps. Steps 4, 5 and 6 must be the exact reverse of steps 3, 2, and 1 or the recipient will just end up with garbage. Many cryptographic packages do some or all of the above steps for you automatically, which can often make matters even more confusing.

In our "Hello, world...!" example above, the result of step 5 should be a bit string that looks identical to the example shown above. Step 6 involves the implicit knowledge that the plaintext is to be represented as US-ASCII characters and so the receipient would break the bit string into blocks of 8 bits and hope that the resulting bytes come out as valid, readable ASCII characters. The sender and recipient must have agreed beforehand that this is the convention they are both going to use, or that fact must be passed along with the encrypted message.

Strings vs Bytes in Visual Basic

Visual Basic uses different types to store character data (String) and binary data (Byte) but it is easy to get confused, especially as Strings are so much easier to handle than arrays of Bytes. For some hints on the difference between Strings and Byte arrays and how to handle them see Using Byte Arrays in Visual Basic.

All operations on 'binary' data should carried out using the Visual Basic Byte type, and all textual data (original text input, password and base64-encoded data) should be stored in the VB String type. It is important to make sure you differentiate between these two types of data - broadly speaking 'text' and 'binary' - when doing cryptographic operations in Visual Basic.

'Text' in the particular sense we use here consists of readable, printable characters we expect to see on our computer screen or in a book. It might consist of simple US-ASCII/ANSI characters or it could be Unicode or DBCS oriental character strings. 'Binary' data is a string of bits that we conventionally store as bytes or octets.

Binary data must be exactly the same literally bit-for-bit in all systems. Change one bit and the results of any cryptographic operation on it will be completely different. The way to ensure our binary data is always the same is to use the Byte type to store binary data, not the String type.

Text data may be stored differently depending on the particular system we are using. On a ANSI system, each character is stored in one byte. On a Unicode system each character is stored in two bytes; and on a DBCS system, a character may be stored in one or two bytes. It is important to make sure we always convert our 'text' data into exactly the same 'binary' form before we do any encryption.

The input to an encryption process must be 'binary' data. We need to convert the text we want to encrypt into 'binary' format first and then encrypt it. The results of encryption are always binary. Do not attempt to treat raw ciphertext as 'text' or put it directly into a String type. Store ciphertext either as a raw binary file or convert it to base64 or hexadecimal format. You can safely put data in base64 or hexadecimal format in a String.

When you decrypt, always start with binary data, decrypt to binary data, and then and only then, convert back to text, if that is what you are expecting. You can devise your own checks to make sure the decrypted ciphertext is what you expect before you do the final conversion.

On a US-English system set up for ANSI characters, you can probably get away with using a String type to carry out 'binary' operations. However you will encounter problems on a system set up for Chinese/Japanese/Korean/Arabic/Arabic characters.

Unicode and DBCS

Windows handles its international character sets in two ways: DBCS or Unicode.

In Unicode, every character on the system is stored as a 16-bit value*, that is, always using 2 bytes for each character. The values of these characters are defined uniquely in the Unicode standard (see Unicode Code charts). The first 128 characters 0x0000 to 0x007F are the same as the ANSI character set 0x00 to 0x7F (decimal 0 to 127) see C0 Controls and Basic Latin Range: 0000-007F.

Unicode characters are represented using the codepoint U-xxxx where xxxx is the value expressed in hexadecimal format. For example, U-0041 is LATIN CAPITAL LETTER A and U-3053 is HIRAGANA LETTER KO.

* Actually, Unicode can use more than 16-bits. In the Windows world, Visual Basic, COM, and Windows NT/2000/XP use what is strictly known as UCS-2 or UTF-16 Unicode as their native string type. This is a sub-set of Unicode that always uses 16 bits or two bytes for each character. Only really obscure characters require anything bigger - don't worry about them for practical purposes for now.

DBCS stands for Double-Byte Character Set and in the Windows world is defined as

A character set that uses 1 or 2 bytes to represent a character, allowing more than 256 characters to be represented
The DBCS system in Windows works by using single bytes for the 'usual' ANSI chacacters but then extends this by using special leading bytes that signify that the following byte represents a certain extended character in another character set. Thus you can only parse a DBCS character string by reading each byte in order - if you start part-way through you have no way of knowing whether there was a preceding leading byte that changes the meaning of your next byte.

The Windows definition of DBCS should not be confused with the IBM definition of DBCS that always uses 16 bits to represent a character. Strictly the Windows set should be called a Multi-Byte Character Set (MBCS).

UTF-8 (Unicode Transformation Format-8) is a format in the Unicode coding system that uses from one to six bytes. It has the big advantage that a plain ASCII string is also a valid UTF-8 string and it doesn't break existing C programs that expect ASCII values. UTF-8 is described in detail in RFC 3629. Now, UTF-8 is also a subset of "Unicode", but is different from the UCS-2 and UTF-16 and DBCS systems described above. It's probably the most versatile and useful format going around. Your text editor may or may not support it. Your operating system may or may not be able to cope with UTF-8 data. UTF-8 strings can be fairly reliably recognized as such by a simple algorithm.

ANSI vs Unicode

The first 128 characters of the Unicode character set are the original US-ASCII characters that western computer programmers have come to know and love with values 0x00 to 0x7F. The letter capital 'A' is represented in ANSI as the 8-bit value 0x41 and in Unicode as the 16-bit value 0x0041, the letter capital 'H' is represented in ANSI as 0x48 and in Unicode as 0x0048.

In the C programming language, you use the wchar_t type for Unicode characters. Here is some code that shows us the 'raw' bytes in plain-old-char strings and wchar_t strings.

#include <stdio.h>
#include <string.h>

char *str = "Hello, world...!";
wchar_t *wchello = L"Hello, world...!";
unsigned char *ptr;
size_t len, i;

/* Display single-byte chars */
printf("%s\n", str);
len = strlen(str);
ptr = (unsigned char *)str;
for (i = 0; i < len; i++)
	printf("%02x ", ptr[i]);
printf("\n");

/* Display (machine-dependent-size-and-order) wide chars */
wprintf(L"%s\n", wchello);
len = wcslen(wchello) * sizeof(wchar_t);
ptr = (unsigned char *)wchello;
for (i = 0; i < len; i++)
	printf("%02x ", ptr[i]);
printf("\n");
This produces the following on our W2000 system:
Hello, world...!
48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 2e 2e 2e 21
Hello, world...!
48 00 65 00 6c 00 6c 00 6f 00 2c 00 20 00 77 00 
6f 00 72 00 6c 00 64 00 2e 00 2e 00 2e 00 21 00

If you run this program on a little-endian computer like we did, the Unicode characters are actually stored in reverse order, so the Unicode character 0x0048 is converted to individual bytes with its least-significant byte 0x48 first and the most-significant byte 0x00 second. That's why the result of the conversion in the second example above gave the bytes in the order 48, 00, 65, 00, etc.

Confused, you should be. It gets worse...

'Hello' in Japanese

In Hiragana Japanese characters, the word 'Hello' is

こんにちは

These are the five characters KO-N-NI-TI-HA. These will only display on your browser if it supports internationalised characters. If you just see 5 little box characters, your browser doesn't support it. See hello in Hiragana for a full graphics version.

If your Japanese system is set up for DBCS characters, these five characters will be stored in Visual Basic as the 10 bytes:

82 B1 82 F1 82 C9 82 BF 82 CD
This uses the Shift-JIS system. The lead byte 0x82 indicates that the character is from the Hiragana set and can be found in the Microsoft Windows codepage 932 with lead byte 82. See Microsoft Windows Codepage:932(Japanese Shift-JIS) and Lead Byte=0x82.

In Unicode, however, these five characters will be represented by the five 16-bit Unicode values U-3053, U-3093, U-306B, U-3061, U-306F. A straight conversion to a byte array on a little-endian machine will result in the 10 bytes:

53 30 93 30 6B 30 61 30 6F 30

If you want to encrypt your Japanese characters, you need to convert your String to a Byte array first and then use an encryption function that expects an array of Bytes. After decryption you need to carry out exactly the same conversion back to a String that your system will display correctly.

But how can I encrypt on one system and decrypt on one that uses a different character set?

You can't; well, not directly. Think about it simply in terms of passing straightforward binary data between two systems. We have some text (that we save as a bit string) that means something useful on a system that understands, say, Arabic and then we copy the bit string to another system where we use, say, Chinese. Would we expect to find something the Chinese system will understand? Probably not. A bit string that only means something in Arabic will not mean anything useful when we try to read it using a Chinese system. Then why expect a bit string to be understandable across different systems just because we encrypted it first on one system and then decrypted it on another? Don't blame the encryption system when there is something more fundamental at fault.

Unfortunately this holds even where we expect our systems to be using the same language but they use different character sets, unless we are very careful to make sure we encode the data in a format that makes sense to both systems. The simple word 'hello' in Hiragana can be stored either as the array of bytes (82 B1 82 F1 82 C9 82 BF 82 CD) or as (53 30 93 30 6B 30 61 30 6F 30) depending on whether our system is configured to use DBCS or Unicode. If you have a big-endian computer, then the 16-bit Unicode values would be stored differently as (30 53 30 93 30 6B 30 61 30 6F). Do not expect the Unicode system to make any useful sense of the bit string 0x82B182F182C982BF82CD or the DBCS sytem to understand 0x533093306B3061306F30.

If you are having problems passing encrypted data between different systems, try this simple test. Save your unencrypted plaintext string as a binary file on the first system, copy the file to the other system, then see if you can reproduce the original string when you read it in from the binary file. If this does not work, you cannot expect the data to be transportable when it has to be encrypted on one system and decrypted on the other without some sort of normalisation first.

Example code

On the first system do this:
    Dim strData as String
    Dim hFile As Integer
    
    strData = "Hello, world!"
    Debug.Print strData
    hFile = FreeFile
    Open "MyFile.dat" For Binary Access Write As #hFile
    Put #hFile, , strData
    Close #hFile
Copy the file MyFile.dat to the other system and run this:
    Dim strData As String
    Dim hFile As Integer
 
    hFile = FreeFile
    Open "MyFile.dat" For Binary Access Read As #hFile
    strData = Input(LOF(hFile), #hFile)
    Close #hFile
    Debug.Print strData
Did you get the same result on both systems? If so, you should be able to encrypt the string on the first system, transfer the encrypted data to the other, decrypt the ciphertext on the new system, and make good use of the results. If the result of simply transferring a file from one system to the other did not result in anything useful, don't bother trying to send encrypted data. You must be able to normalise your plain data on both systems so it can be read on both of them.

Transmitting data between systems

When you pass ciphertext data between systems, we strongly recommend that you do not pass the data as raw 'binary' data. Binary data can easily be corrupted in transit, and a different systems may interpret some of the bytes in the binary data as control characters with unintended results. This is before we deal with the endian-ness of the two systems.

You should always encode the ciphertext data using, say, base64 or hexadecimal encoding. The resulting text can be easily transferred without loss of integrity and can then be decoded back into binary on the destination system before decryption.

We usually use hexadecimal encoding for short messages because it is easy to see the results and count the bytes. It is virtually impossible to decode base64 data in your head, even though it's shorter, and this is a major disadvantage when debugging. Just make sure that both parties agree to do this.

How do I handle byte ordering?

Systems based on IBM and Motorola processors store multi-byte data in what is known as big-endian order. The 16-bit Unicode character 0x0048 is stored internally as 00 48. Intel and MIPS processors store data "backwards" in little-endian order where 0x0048 would be stored as 48 00. If we use a high-level language this internal ordering should not matter. But when we want to do encryption on multi-byte strings by treating our data as a bit string where order does matter, we run into severe problems.

The terms big-endian and little-endian come from Jonathan Swift's satire Gulliver's Travels written in 1726 in which the Big Endians were a political faction that broke their eggs at the large end and rebelled against the Lilliputian King who required his subjects (the Little Endians) to break their eggs at the small end.

If, back in the 21st century, we store the wide-character version of the string "Hello..!" on a little-endian computer such as an Intel or MIPS processor, we will have the bytes

48 00 65 00 6c 00 6c 00 6f 00 2e 00 2e 00 21 00 
If we encrypt this with DES in ECB mode using the key 0xfedcba9876543210, we will get the ciphertext
8CE18992E3558713496097A3A385A5DC
Now if we do exactly the same on a big-endian computer such as Motorola processor, we start with the bytes
00 48 00 65 00 6c 00 6c 00 6f 00 2e 00 2e 00 21 
and performing exactly the same encryption, we get
BFB8CF02A0E0D0112D9B8658A2B2E963
Notice any similarity between the two ciphertexts? Mmm... well, the 15th and 25th nibbles are both the same. Create the ciphertext on your Intel processor, send it to your mate who tries to decrypt on a big-endian Motorola, and he won't get anything useful. Even worse, do all your testing using the same type of computers and you'll never even notice until your major client rings up with a problem.

If you don't know beforehand what byte order you are going to have to deal with, try prefixing a byte-order mark to your Unicode text data before you encrypt. Prefix your plaintext Unicode data with the byte-order mark character 0xFEFF. This is the convention for Unicode text files. If your application finds the bytes FF FE in that order at the beginning of the decrypted plaintext, then it knows that the following data is little-endian Unicode text. If it finds FE FF as the first bytes, then the data is big-endian Unicode and should be dealt with accordingly. Make sure you agree to this convention beforehand and write your decrypion code to look for and remove it.

See Microsoft's page on the Byte-order Mark for more details.

More on Multiple-Byte Characters and Character Encoding

We have deliberately skipped over a lot of complicated concepts in this document, such as strict definitions of character repertoire, character code, character encoding, code point, glyph, and font.

For a far better explanation on ASCII and Unicode characters, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets by Joel Spolsky. If you want a more detailed explanation, see Jucca Korpela's A tutorial on character code issues. C programmers wanting to convert between MBCS and Unicode should see Angel Ortega's Unicode, UTF-8 and all that. Another excellent guide is James Brown's Introduction to Unicode on his Catch22 Win32 web site.

Ken Lunde of Adobe Systems, Inc. has prepared an excellent document Perl & Multiple-byte Characters which looks at CJKV character sets and the many different ways they can be encoded. He shows how to convert between the different encoding sets and shows how they can be normalised. The examples are in Perl and make good use of regular expressions - try writing the equivalent code in C or Visual Basic! That document and others can also be downloaded from ftp://ftp.ora.com/pub/examples/nutshell/ujip/perl/perl97.pdf. You can download the first chapter of Ken's book CJKV Information Processing.

A final word

This page is meant as an introduction to some of the issues involved in carrying out encryption on international character sets. It is not meant to be an exhaustive reference. We hope it shows some of the difficulties and gives you some hints on how to solve them. If we've made a mistake, please tell us. If you can suggest a better way of doing something, please let us know. We are always keen to carry out some tests with our existing software on systems set up for CJK character sets. If you are interested in working with us to do such tests and you have a system set up for CJK characters, please contact us.

Related Topics

See also our pages on:

This document last updated 2 September 2008.

To comment on this Contact DI Management.    Return to the Tips page.    [Top]Return to top of page