Previous | Next | Trail Map | Internationalization | Converting Non-Unicode Text

Byte Encodings and Strings

If your non-Unicode text is stored in a byte array, you can convert it to Unicode with one of the String constructor methods. Conversely, you can convert a String object into a byte array of non-Unicode characters with the String.getBytes method. When invoking each of these methods, you specify the encoding identifier as one of the parameters.

In the example that follows, we'll convert characters between UTF8 and Unicode. UTF8 is a compact binary form for encoding 16-bit Unicode characters into 8 bits. The source code for the example is in the file named StringConverter.java.

First, we create a String containing Unicode characters:

String original = new String("A" + "\u00ea" + "\u00f1"
                                 + "\u00fc" + "C");

When printed, the String named original appears as:

AêñüC
To convert the String object to UTF8, we invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF8 format. To create a String object from an array of non-Unicode bytes, we invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the encoding we've specified is unsupported:
try {
   byte[] utf8Bytes = original.getBytes("UTF8");
   byte[] defaultBytes = original.getBytes();

   String roundTrip = new String(utf8Bytes, "UTF8");
   System.out.println("roundTrip = " + roundTrip);

   System.out.println();
   printBytes(utf8Bytes, "utf8Bytes");
   System.out.println();
   printBytes(defaultBytes, "defaultBytes");
   }
catch (UnsupportedEncodingException e) {
   e.printStackTrace();
}
We print out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text may not be the same as the length of the source text. Some Unicode characters translate into single bytes, and others into pairs of bytes. Our routine for displaying the byte arrays is:
public static void printBytes(byte[] array, String name) {
   for (int k = 0; k < array.length; k++) {
      System.out.println(name + "[" + k + "] = " + "0x" +
         UnicodeFormatter.byteToHex(array[k]));

   }
}
The output of the printBytes methods follows. Note that only the first and last bytes, the "A" and "C" characters, are the same in both arrays:
utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43

defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43


Previous | Next | Trail Map | Internationalization | Converting Non-Unicode Text