Detecting Text Boundaries |
You invoke thegetWordIterator
method to instantiate aBreakIterator
that detects word boundaries:You'll want to create such aBreakIterator wordIterator = BreakIterator.getWordInstance(currentLocale);BreakIterator
when your application needs to perform operations on individual words. These operations might be common word processing functions such as selecting, cutting, pasting, and copying. Or, your application may search for words, and to do so it must be able to distinguish entire words from simple strings.When performing word boundary analysis, a
BreakIterator
differentiates between words and characters that are not part of words. These characters, which include spaces, tabs, punctuation marks, and some symbols, have word boundaries on both sides.In the example that follows, which is from the program BreakIteratorDemo.java, we want to mark the word boundaries in some text. First, we create the
BreakIterator
and then call a method we've written calledmarkBoundaries
:The purpose of theLocale currentLocale = new Locale ("en","US"); BreakIterator wordIterator = BreakIterator.getWordInstance(currentLocale); String someText = "She stopped. " + "She said, \"Hello there,\" and then went on."; markBoundaries(someText, wordIterator);markBoundaries
method is to mark the word boundaries in a string with a caret ('^'). Every time theBreakIterator
detects a word boundary, we insert a caret into themarkers
buffer. We scan the string in a loop, invoking thenext
method until it returnsBreakIterator.DONE
. Here is the code for themarkBoundaries
routine:Thestatic void markBoundaries(String target, BreakIterator iterator) { StringBuffer markers = new StringBuffer(); markers.setLength(target.length() + 1); for (int k = 0; k < markers.length(); k++) { markers.setCharAt(k,' '); } iterator.setText(target); int boundary = iterator.first(); while (boundary != BreakIterator.DONE) { markers.setCharAt(boundary,'^'); boundary = iterator.next(); } System.out.println(target); System.out.println(markers); }markBoundaries
method prints out thetarget
string and themarkers
buffer. Note where the carets ('^') occur in relation to the punctuation marks and spaces:TheShe stopped. She said, "Hello there," and then went on. ^ ^^ ^^ ^ ^^ ^^^^ ^^ ^^^^ ^^ ^^ ^^ ^BreakIterator
makes it easy to select words from within text. You don't have to write your own routines to handle the punctuation rules of various languages, because theBreakIterator
class does this for you. Here is a routine which extracts and prints words for a given string:In our sample program, we invokestatic void extractWords(String target, BreakIterator wordIterator) { wordIterator.setText(target); int start = wordIterator.first(); int end = wordIterator.next(); while (end != BreakIterator.DONE) { String word = target.substring(start,end); if (Character.isLetterOrDigit(word.charAt(0))) { System.out.println(word); } start = end; end = wordIterator.next(); } }extractWords
, passing it the same target string we used in the previous example. TheextractWords
method prints out the following list of words:She stopped She said Hello there and then went on.
Detecting Text Boundaries |