Chapter 8 Characters and Strings

advertisement
Chapter 8 Characters and Strings
Principle of enumeration
• Computers tend to be good at working with numeric
data.
• The ability to represent an integer value, however,
also makes it easy to work with other data types as
long as it is possible to represent those types using
integers. For types consisting of a finite set of values,
the easiest approach is simply to number the elements
of the collection.
• Types that are identified by counting off the elements
are called enumerated types.
Characters
• Computers use the principle of enumeration to represent
character data inside the memory. If you assign an integer to
each character, you can use that integer as a code for the
character it represents
• Character codes, however, are not particularly useful unless
they are standardized.
• The first widely adopted character coding was ASCII:
American Standard Code for Information Interchange.
• With only 256 characters, the ASCII system proved inadequate
to represent the many alphabets in use throughout the world.
• ASCII has been superseded by Unicode.
• Figure 8-1, p. 256, table.
Some notes
• The first thing to remember about the Unicode table
is that you don’t actually have to learn the numeric
code for the characters. The important observation is
that a character has a numeric representation, and not
what that representation happens to be.
• A character constant consists of the desired character
enclosed in single quotation marks. Thus, the constant
‘A’ in a program indicates the Unicode representation
of an upper case A. That it has the value 1018 = 6510 is
irrelevant detail.
Important properties
• The codes for the digits 0 through 9 are consecutive.
‘0’ + 9 is ‘9’
• The codes for the uppercase letters A through Z are
consecutive; the codes for the lowercase letters a
through z are consecutive.
‘a’ + 2 is ‘c’
The arithmetic operations can be used with character
values just as with integers.
Avoid using integer constants to refer to Unicode
characters .
Special characters
• Most of the characters in the Unicode table appear on the
keyboard. They are called printing characters.
• The table also includes special characters. They are indicated
in the Unicode table by an escape sequence, which consists of
a backslash followed by a character or sequence of digits.
\b Backspace
\f
Form feed (starts a new page)
\n Newline (moves to the next line)
\r
Return (moves to the beginning of the current line)
\t
Tab (moves to the next tab)
\\
Backslash character itself
\’
The character ‘
\”
The character “
\ddd The character whose Unicode is the octal number ddd
Conversion
• It is better to make the conversion between int
(Unicode) and char (character) explicit by introducing
type casts.
Example
Randomly generate an uppercase letter.
private char randomLetter() {
return (char) rgen.nextInt((int) ‘A’, (int) ‘Z’);
}
The operations that generally make sense:
• Adding an integer to a character (usually a digit).
• Subtracting one character from another.
‘a’ – ‘A’ gives the distance between a lowercase letter and its
corresponding uppercase letter.
‘M’ + (‘a’ – ‘A’) gives ‘m’
This can be used to convert uppercase letters into lowercase
letters.
• Comparing two characters
(ch >= ‘a’) && (ch <= ‘z’) is true if ch is a lowercase letter
Useful methods in the character class
static boolean isDigit(char ch)
static boolean isLetter(char ch)
static boolean isLetterOrDigit(char ch)
static boolean isLowerCase(char ch)
static boolean isUpperCase(char ch)
static boolean isWhitespace (char ch)
static char toLowerCase(char ch)
Static char toUpperCase(char ch)
Strings
• Java defines many useful methods that operate on the String
class.
• The String class uses the receiver syntax when you call a
method on a string
• String class is immutable. None of its methods ever changes
the internal state. Classes that prohibit clients from changing
an object’s state is said to be immutable.
• What happens is that these methods return a new string on
which the desired changes have been performed.
• To change a string, you can overwrite a string:
str = str.toLowerCase();
Strings vs. characters
• Both the String and the Character classes export a
toUpperCase method.
• In the Character class, you call toUpperCase as a
static method
ch = Character.toUpperCase(ch);
• In the String class, you apply toUpperCase to an
existing string
str = str.toUpperCase();
Selecting characters from a string
• In Java, positions within a string are numbered
starting from 0.
str.charAt(1) gives the second character in str.
• A substring can be extracted from a larger string. If a
string variable str contains “hello, world”
str.subString(1, 4);
returns “ell”
Comparing strings
• Equality: Use s1.equals(s2) instead of s1 == s2 for equality,
since s1 == s2 compares objects s1 and s2 (references) not
values (content) of objects.
• Order: Use s1.compareTo(s2). It compares two strings s1
and s2 using the numeric ordering imposed by the underlying
character codes (lexicographic order), different from
conventional dictionary ordering.
• For characters, c1 < c2, compares the codes of c1 and c2.
Other methods in the String class, Figure 8-4, p. 266.
Searching within a string
/** Given a string composed of separate words, this method returns its
* acronym.
* @param str Given string composed of separate words.
* @return The acronym of the given string.
*/
private String acronym(String str) {
String result = str.substring(0,1);
/* get the first character */
int pos = str.indexOf(‘ ‘);
/* position of the first space */
while (pos != -1) {
/* while not the end */
result += str.substring(pos + 1, pos + 2);
/* concat a leter */
pos = str.indexOf(‘ ‘, pos + 1); /* position of next space */
}
return result;
}
Simple string idioms
• Iterating through the characters in a string
for (int i = 0; i < str.length(); i++) {
char ch = str.charAt(i);
code to process each character in turn . . .
}
• Growing a new string character by character
String result = “”;
for (whatever limits) {
code to determine next ch to be added . . .
result += ch;
}
A case study
/*
* File: PigLatin.java
* -----------------------* This file takes a line of text and converts each word into Pig Latin while
* keeping punctuation marks.
* The rules for forming Pig Latin words are as follows:
* - If the word begins with a vowel, add “way” to the end of the word.
* - If the word begins with a consonant, extract the set of consonants up
* to the first vowel, move that set of consonants to the end of the word
* and add “ay”.
* - If the word contains no vowel, the word is unchanged.
*/
• Top level English pseudo code
public void run() {
Tell the user what the program does.
Ask the user for a line of text.
Translate the line into Pig Latin and print it on the
console.
}
• Implementation at the current level
public void run() {
println(“This program translates a line into Pig Latin.”);
String line = readLine(“Enter a line: “);
Translate the line into Pig Latin and print it on the console.
}
• Define a method to replace English, interface design
public void run() {
println(“This program translates a line into Pig
Latin.”);
String line = readLine(“Enter a line: “);
println(translateLine(line));
}
/**
* Translates a line into Pig Latin
* @param line An English line
* @return The Pig Latin
*
*/
Private String translateLine(String line)
• Next level English pseudo code
Apply a pattern, recalling the acronym pattern.
private String translateLine(String line) {
String result = “”;
while not end {
Get the next word;
Translate that word into Pig Latin;
Append the translated word to result;
}
return result;
}
• As a programmer, you will often trip over some detail
that the framers of the problem either overlooked or
considered too obvious to mention. In some cases, the
omission is serious enough that you have to discuss it
with the person who assigned you the programming
task. In many cases, however, you will have to choose
for yourself a policy that seems reasonable.
– In this case, the specification is unclear about spaces and
punctuation marks. A reasonable decision is: Keep spaces
and punctuation marks, translate words only.
Implementation guideline
• Identify reusable codes.
• Use library whenever possible.
StringTokenizer class
import java.util.*;
Token is a sequence of characters that acts as a
constant unit.
– In this case, take a word as a token, punctuation
marks as delimiters.
Define DELIMITERS: check wikipedia or keyboard.
Implementation guideline (cont.)
• Use the character methods, FIGURE 8-3, and string
methods, FIGURE 8-4.
• Use for instead of while whenever possible.
– Use for in findFirstVowel, since we can get word.length
– Use for in isWord, since we can get token.length
• Use table to exhaust cases.
– findFirstVowel, which is called by translateWord, returns
a value -1 or 0 or a positive integer. Thus translateWord
must handle all the cases.
Summary
For each level
• English pseudo code
• Straight implementations at the current level
• Design methods to replace English pseudo code
• Go to next level methods
Apply implementation guideline.
English pseudo code can be used as comments.
Testing
• Bottom-up testing (start with testing methods at the
lowest level and move up, test callees before the
caller)
• Test normal cases
• Test special or extreme (boundaries of input
variables) cases
• Black-box testing (verify input/output specifications)
• White-box testing (execute every part of the code,
conditions in if, switch)
Download