Previous | Next | Trail Map | Internationalization | Detecting Text Boundaries

Sentence Boundaries

In many languages the sentence terminator is a period. In English, we also use the period to specify a decimal separator, to indicate an ellipsis mark, and to terminate abbreviations. Because the period has more than one purpose, we can't always determine sentence boundaries with accuracy.

First, let's look at a case where sentence boundary analysis does work. We start by creating a BreakIterator with the getSentenceInstance method:

BreakIterator sentenceIterator =
   BreakIterator.getSentenceInstance(currentLocale);
To demonstrate sentence boundaries, we'll use the the markBoundaries method, which we discussed in the previous section. The markBoundaries method prints carets ('^') beneath a string to indicate boundary positions. In the following example the sentence boundaries are properly identified:
She stopped.  She said, "Hello there," and then went on.
^             ^                                         ^
You can also locate the boundaries of sentences that end with question marks and exclamation points:
He's vanished!  What will we do?  It's up to us.
^               ^                 ^             ^
Using the period as a decimal point does not cause an error:
Please add 1.5 liters to the tank.
^                                 ^
An ellipsis mark (three spaced periods) indicates the omission of text within a quoted passage. In the next example, the ellipses generate sentence boundaries:
"No man is an island . . . every man . . . "
^                      ^ ^             ^ ^ ^^
Abbreviations may also cause errors. If a period is followed by whitespace and an uppercase letter, the BreakIterator detects a sentence boundary:
My friend, Mr. Jones, has a new dog.  The dog's name is Spot.
^              ^                      ^                      ^


Previous | Next | Trail Map | Internationalization | Detecting Text Boundaries