|
|
Converting Non-Unicode Text |
If your non-Unicode text is stored in a byte array, you can convert it to Unicode with one of theStringconstructor methods. Conversely, you can convert aStringobject into a byte array of non-Unicode characters with theString.getBytesmethod. When invoking each of these methods, you specify the encoding identifier as one of the parameters.In the example that follows, we'll convert characters between UTF8 and Unicode. UTF8 is a compact binary form for encoding 16-bit Unicode characters into 8 bits. The source code for the example is in the file named StringConverter.java.
First, we create a
Stringcontaining Unicode characters:String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");When printed, the
Stringnamedoriginalappears as:To convert theAêñüCStringobject to UTF8, we invoke thegetBytesmethod and specify the appropriate encoding identifier as a parameter. ThegetBytesmethod returns an array of bytes in UTF8 format. To create aStringobject from an array of non-Unicode bytes, we invoke theStringconstructor with the encoding parameter. The code that makes these calls is enclosed in atryblock, in case the encoding we've specified is unsupported:We print out the values in thetry { byte[] utf8Bytes = original.getBytes("UTF8"); byte[] defaultBytes = original.getBytes(); String roundTrip = new String(utf8Bytes, "UTF8"); System.out.println("roundTrip = " + roundTrip); System.out.println(); printBytes(utf8Bytes, "utf8Bytes"); System.out.println(); printBytes(defaultBytes, "defaultBytes"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); }utf8BytesanddefaultBytesarrays to demonstrate an important point: The length of the converted text may not be the same as the length of the source text. Some Unicode characters translate into single bytes, and others into pairs of bytes. Our routine for displaying the byte arrays is:The output of thepublic static void printBytes(byte[] array, String name) { for (int k = 0; k < array.length; k++) { System.out.println(name + "[" + k + "] = " + "0x" + UnicodeFormatter.byteToHex(array[k])); } }printBytesmethods follows. Note that only the first and last bytes, the "A" and "C" characters, are the same in both arrays:utf8Bytes[0] = 0x41 utf8Bytes[1] = 0xc3 utf8Bytes[2] = 0xaa utf8Bytes[3] = 0xc3 utf8Bytes[4] = 0xb1 utf8Bytes[5] = 0xc3 utf8Bytes[6] = 0xbc utf8Bytes[7] = 0x43 defaultBytes[0] = 0x41 defaultBytes[1] = 0xea defaultBytes[2] = 0xf1 defaultBytes[3] = 0xfc defaultBytes[4] = 0x43
|
|
Converting Non-Unicode Text |