Previous | Next | Trail Map | Internationalization | Detecting Text Boundaries

Word Boundaries

You invoke the getWordIterator method to instantiate a BreakIterator that detects word boundaries:
BreakIterator wordIterator =
   BreakIterator.getWordInstance(currentLocale);
You'll want to create such a BreakIterator when your application needs to perform operations on individual words. These operations might be common word processing functions such as selecting, cutting, pasting, and copying. Or, your application may search for words, and to do so it must be able to distinguish entire words from simple strings.

When performing word boundary analysis, a BreakIterator differentiates between words and characters that are not part of words. These characters, which include spaces, tabs, punctuation marks, and some symbols, have word boundaries on both sides.

In the example that follows, which is from the program BreakIteratorDemo.java, we want to mark the word boundaries in some text. First, we create the BreakIterator and then call a method we've written called markBoundaries:

Locale currentLocale = new Locale ("en","US");

BreakIterator wordIterator =
   BreakIterator.getWordInstance(currentLocale);

String someText = "She stopped.  " +
                  "She said, \"Hello there,\" and then went on.";

markBoundaries(someText, wordIterator);
The purpose of the markBoundaries method is to mark the word boundaries in a string with a caret ('^'). Every time the BreakIterator detects a word boundary, we insert a caret into the markers buffer. We scan the string in a loop, invoking the next method until it returns BreakIterator.DONE. Here is the code for the markBoundaries routine:
static void markBoundaries(String target, BreakIterator iterator) {

   StringBuffer markers = new StringBuffer();
   markers.setLength(target.length() + 1);
   for (int k = 0; k < markers.length(); k++) {
      markers.setCharAt(k,' ');
   }

   iterator.setText(target);
   int boundary = iterator.first();

   while (boundary != BreakIterator.DONE) {
      markers.setCharAt(boundary,'^');
      boundary = iterator.next();
   }

   System.out.println(target);
   System.out.println(markers);
} 
The markBoundaries method prints out the target string and the markers buffer. Note where the carets ('^') occur in relation to the punctuation marks and spaces:
She stopped.  She said, "Hello there," and then went on.
^  ^^      ^^ ^  ^^   ^^^^    ^^    ^^^^  ^^   ^^   ^^  ^
The BreakIterator makes it easy to select words from within text. You don't have to write your own routines to handle the punctuation rules of various languages, because the BreakIterator class does this for you. Here is a routine which extracts and prints words for a given string:
static void extractWords(String target, BreakIterator wordIterator) {

   wordIterator.setText(target);
   int start = wordIterator.first();
   int end = wordIterator.next();

   while (end != BreakIterator.DONE) {
      String word = target.substring(start,end);
      if (Character.isLetterOrDigit(word.charAt(0))) {
         System.out.println(word);
      }
      start = end;
      end = wordIterator.next();
   }
} 
In our sample program, we invoke extractWords, passing it the same target string we used in the previous example. The extractWords method prints out the following list of words:
She
stopped
She
said
Hello
there
and
then
went
on.


Previous | Next | Trail Map | Internationalization | Detecting Text Boundaries