Previous | Next | Trail Map | Internationalization | Contents

Converting Non-Unicode Text

In the Java programming language, char values represent Unicode characters. Unicode is a 16-bit character encoding that supports the world's major languages. You can learn more about the Unicode standard at the Unicode Consortium web site.

Few text editors currently support Unicode text entry. The text editor we used to write this lesson's code examples supports only ASCII characters, which are limited to 7-bits. To indicate Unicode characters that cannot be represented in ASCII, such as "ö," we used the '\udddd' escape sequence. Each "d" in the escape sequence is a hexadecimal digit. The following example shows how to indicate the "ö" character with an escape sequence:

String str = "\u00F6";
char c = '\u00F6';
Character letter = new Character('\u00F6');
We don't have to specify the Unicode escape sequence for ASCII characters. When reading ASCII and ISO Latin-1 files, the Java runtime environment automatically converts the characters into Unicode. However, if you want to convert text from other encodings into Unicode, you must perform the conversions yourself.

In this lesson we discuss the APIs you use to translate non-Unicode text into Unicode. Before using these APIs, you should verify that the character encoding you wish to convert into Unicode is supported. The list of supported character encodings is not part of the Java programming language specification. Therefore, the character encodings supported by the APIs may vary with platform. To see which encodings the Java Development Kit supports, see the "Supported Encodings" section in the Internationalization Overview document.

The following sectoins describe two techniques for converting non-Unicode text:

Byte Encodings and Strings

This section shows you how to convert non-Unicode byte arrays into String objects, and vice versa.

Character and Byte Streams

In this section you'll learn how to translate between streams of Unicode characters and byte streams of non-Unicode text.


Previous | Next | Trail Map | Internationalization | Contents