Detecting Text Boundaries |
In many languages the sentence terminator is a period. In English, we also use the period to specify a decimal separator, to indicate an ellipsis mark, and to terminate abbreviations. Because the period has more than one purpose, we can't always determine sentence boundaries with accuracy.First, let's look at a case where sentence boundary analysis does work. We start by creating a
BreakIterator
with thegetSentenceInstance
method:To demonstrate sentence boundaries, we'll use the the markBoundaries method, which we discussed in the previous section. TheBreakIterator sentenceIterator = BreakIterator.getSentenceInstance(currentLocale);markBoundaries
method prints carets ('^') beneath a string to indicate boundary positions. In the following example the sentence boundaries are properly identified:You can also locate the boundaries of sentences that end with question marks and exclamation points:She stopped. She said, "Hello there," and then went on. ^ ^ ^Using the period as a decimal point does not cause an error:He's vanished! What will we do? It's up to us. ^ ^ ^ ^An ellipsis mark (three spaced periods) indicates the omission of text within a quoted passage. In the next example, the ellipses generate sentence boundaries:Please add 1.5 liters to the tank. ^ ^Abbreviations may also cause errors. If a period is followed by whitespace and an uppercase letter, the"No man is an island . . . every man . . . " ^ ^ ^ ^ ^ ^^BreakIterator
detects a sentence boundary:My friend, Mr. Jones, has a new dog. The dog's name is Spot. ^ ^ ^ ^
Detecting Text Boundaries |