Previous | Next | Trail Map | Internationalization | Detecting Text Boundaries

About the BreakIterator Class

The BreakIterator class is locale-sensitive, because text boundaries vary with language. For example, the syntax rules for line breaks are not the same for all languages. To determine which locales the BreakIterator class supports, invoke the getAvailableLocales method:
Locale[] locales = BreakIterator.getAvailableLocales();

You can analyze four different kinds of boundaries with the BreakIterator class: character, word, sentence, and potential line break. When instantiating a BreakIterator, you invoke the appropriate creation method:

Each instance of BreakIterator can detect just one type of boundary. If you want to locate both character and word boundaries, for example, you'll need to create two separate instances.

A BreakIterator has an imaginary cursor that points to the current boundary in a string of text. You can move this cursor within the text with the previous and next methods. For example, if you've created a BreakIterator with getWordInstance, every time you invoke the next method the cursor moves to the next word boundary in the text. The cursor movement methods return an integer indicating the position of the boundary. This position is the index of the character in the text string that would follow the boundary. Like string indexes, the boundaries are zero-based. The first boundary is at 0, and the last boundary is the length of the string.

You should use the BreakIterator class only with natural language text. Do not use it to tokenize programming language.

In the sections that follow, we'll provide examples for each type of boundary analysis. The coding examples are from the source code file named BreakIteratorDemo.java.


Previous | Next | Trail Map | Internationalization | Detecting Text Boundaries