R Graphics

advertisement
Graphics and Visualization
This is an overview of some of the standard methods available in R for visualization of
data with statistical graphics. Examination of your data graphically is an important early
step of any data analysis. In general, you should start off with univariate methods,
histograms and such, examining each variable in isolation. Then look at pairs of
variables with scatterplots, and work your way up to high-dimensional methods.
Only a small number of examples of each method will be provided. Remember that you
can always use the help function to get more details and options on any of these
functions.
1. Categorical Data
data(Titanic)
Titanic
, , Age = Child, Survived = No
Sex
Class Male Female
1st
0
0
2nd
0
0
3rd
35
17
Crew
0
0
, , Age = Adult, Survived = No
Sex
Class Male Female
1st
118
4
2nd
154
13
3rd
387
89
Crew 670
3
, , Age = Child, Survived = Yes
Sex
Class Male Female
1st
5
1
2nd
11
13
3rd
13
14
Crew
0
0
, , Age = Adult, Survived = Yes
Sex
Class Male Female
1st
57
140
2nd
14
80
3rd
75
76
Crew 192
20
ftable(Titanic)
Survived
Class Sex
1st
Male
Female
2nd
Male
Female
3rd
Male
Female
Crew
Male
Female
Age
Child
Adult
Child
Adult
Child
Adult
Child
Adult
Child
Adult
Child
Adult
Child
Adult
Child
Adult
No Yes
0
5
118 57
0
1
4 140
0 11
154 14
0 13
13 80
35 13
387 75
17 14
89 76
0
0
670 192
0
0
3 20
Titanic1<-margin.table(Titanic, 1)
Titanic1
Class
1st 2nd 3rd Crew
325 285 706 885
1a. Bar Chart
barplot(Titanic1)
barplot(Titanic1, main="Individuals on the Titanic")
1b. Pie Chart (usually, a bar chart is better!)
pie(Titanic1, main="Individuals on the Titanic")
1c. Stacked Bar Chart
Titanic2<-margin.table(Titanic, c(4,1))
Titanic2
Class
Survived 1st 2nd 3rd Crew
No 122 167 528 673
Yes 203 118 178 212
barplot(Titanic2, legend.text=T, main="Survival on the Titanic, By Class")
The ylim argument will let us stretch the y-axis to better fit the legend.
barplot(Titanic2, ylim = c(0,1100), legend.text=T, main="Survival on the
Titanic, By Class")
1d. Grouped Bar Chart
barplot(Titanic2, beside=T, ylim = c(0,800), legend.text=T, main="Survival on
the Titanic, By Class")
2. Univariate Data
data(iris)
iris
Sepal.Length Sepal.Width Petal.Length Petal.Width
1
5.1
3.5
1.4
0.2
2
4.9
3.0
1.4
0.2
:
:
:
:
:
150
5.9
3.0
5.1
1.8
plength<-iris[,3]
species<-iris[,5]
2a. Strip Chart (Univariate Scatterplot)
stripchart(plength)
stripchart(plength, "jitter")
stripchart(plength, "stack")
stripchart(plength~species, method="stack")
Species
setosa
setosa
:
virginica
2b. Box Plot
boxplot(plength)
boxplot(plength~species)
2c. Histogram
hist(plength)
Using "freq=F" puts the histogram on a density scale (total area = 1).
hist(plength, breaks="Scott", freq=F)
You can select your bin edges explicitly. Looking at multiple bin widths and/or starting
points is a good idea.
t<-seq(0.5,7,by=0.5)
t
[1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
hist(plength,breaks=t)
hist(plength,breaks=t+.25)
2d. Kernel Density Estimates
Kernel density estimates are similar to histograms, but smoother and more accurate
estimates of the underlying density. A "kernel" (by default, a normal curve) is placed
centered on each observation, scaled to have area 1/n and with a standard deviation equal
to a smoothing parameter called the bandwidth ("bw"). The (vertical) sum of all of the
kernels is the final estimate.
Here's an example, showing how a KDE is built up. Don't worry too much about the
code.
X <- rnorm(10)
X
[1] 0.05395434 -0.86492411 -1.30461335 1.81755076
[7] -0.55122585 0.48825963 -1.72901353 -0.38337110
0.55755471 -1.14514124
plot(density(X, bw=.4))
points(X, rep(.02,10), pch=16)
t <- seq(-3, 3, length=101)
for (i in 1:10) {lines(t, dnorm(t, X[i], .4)/10, col="red")}
plot(density(plength))
The "adjust" argument is a multiplier on the default bandwidth. As with histograms,
looking at more than one is a good idea.
plot(density(plength, adjust=1/2))
plot(density(plength,adjust=1/4))
plot(density(plength,adjust=2))
Setting bw="SJ" chooses a good ("Sheather-Jones") data-driven bandwidth.
plot(density(plength, bw="SJ"), main="Estimated PDF of Petal Length")
3.
A) Bivariate Normal
library(mixtools)
library(mvtnorm)
set.seed(17)
p <- rmvnorm(1000, c(250000, 20000), matrix(c(100000^2, 22000^2, 22000^2, 6000^2),2,2))
plot(p, pch=20, xlim=c(0,500000), ylim=c(0,50000), xlab="Packets", ylab="Flows")
ellipse(mu=colMeans(p), sigma=cov(p), alpha = .05, npoints = 250, col="red")
Plotting two bivariate normals in 3d and their contours respectively
# lets first simulate a bivariate normal sample
library(MASS)
bivn <- mvrnorm(1000, mu = c(0, 0), Sigma = matrix(c(1, .5, .5, 1), 2))
# now we do a kernel density estimate
bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50)
# fancy perspective
persp(bivn.kde, phi = 45, theta = 30, shade = .1, border = NA)
# fancy contour with image
image(bivn.kde); contour(bivn.kde, add = T)
library(rgl)
col1 <- rainbow(length(bivn.kde$z))[rank(bivn.kde$z)]
persp3d(x=bivn.kde, col = col1)
B) Bivariate Data
data(state)
dimnames(state.x77)
[[1]]
[1] "Alabama"
[5] "California"
[9] "Florida"
[13] "Illinois"
[17] "Kentucky"
"Alaska"
"Colorado"
"Georgia"
"Indiana"
"Louisiana"
"Arizona"
"Connecticut"
"Hawaii"
"Iowa"
"Maine"
"Arkansas"
"Delaware"
"Idaho"
"Kansas"
"Maryland"
[21]
[25]
[29]
[33]
[37]
[41]
[45]
[49]
"Massachusetts"
"Missouri"
"New Hampshire"
"North Carolina"
"Oregon"
"South Dakota"
"Vermont"
"Wisconsin"
"Michigan"
"Montana"
"New Jersey"
"North Dakota"
"Pennsylvania"
"Tennessee"
"Virginia"
"Wyoming"
[[2]]
[1] "Population" "Income"
[6] "HS Grad"
"Frost"
"Minnesota"
"Nebraska"
"New Mexico"
"Ohio"
"Rhode Island"
"Texas"
"Washington"
"Illiteracy" "Life Exp"
"Area"
"Mississippi"
"Nevada"
"New York"
"Oklahoma"
"South Carolina"
"Utah"
"West Virginia"
"Murder"
illiteracy<-state.x77[,3]
murder<-state.x77[,5]
3a. Scatterplots
The simplest way to call a scatterplot is plot(x, y). Formula notation plot(y~x) is also
useful, as a number of R functions use models with this sort of formulation.
plot(illiteracy,murder)
plot(Murder~Illiteracy, data=state.x77)
You can change the plotting symbol (glyph) with the "pch" argument. You can change
the color of the points with the "col" argument. Labels can be added to an existing plot
with the text function.
plot(illiteracy,murder,col="red", pch = 16, xlim=c(0.2,3))
text(illiteracy,murder,labels=state.name)
You can pass a vector of plotting characters or colors and they will be applied to the
individual points. This can be useful for separating points by category. The unclass
function can be helpful in turning a factor vector into a numeric one for this purpose.
state.region
[1] South
West
West
South
West
[6] West
Northeast
South
South
South
[11] West
West
North Central North Central North Central
[16] North Central South
South
Northeast
South
[21] Northeast
North Central North Central South
North Central
[26] West
North Central West
Northeast
Northeast
[31] West
Northeast
South
North Central North Central
[36] South
West
Northeast
Northeast
South
[41] North Central South
South
West
Northeast
[46] South
West
South
North Central West
Levels: Northeast South North Central West
unclass(state.region)
[1] 2 4 4 2 4 4 1 2 2 2 4 4 3 3 3 3 2 2 1 2 1 3 3 2 3 4 3 4 1 1 4 1 2 3 3 2 4
1
[39] 1 2 3 2 2 4 1 2 4 2 3 4
attr(,"levels")
[1] "Northeast"
"South"
"North Central" "West"
plot(illiteracy,murder,pch=unclass(state.region),xlim=c(0.2,3))
plot(illiteracy,murder,col=unclass(state.region),xlim=c(0.2,3),pch=16)
plot(illiteracy,murder,col=unclass(state.region),pch=unclass(state.region),
xlim=c(0.2,3), main = "Murder vs. Illiteracy Rates - U.S. States")
legend("bottomright",levels(state.region),pch=1:4,col=1:4)
identify(illiteracy,murder,state.name)
[1] 1 10 11 18 28 31 34 41
3b. Time Series and Line Plots
You can plot line plots (as for time series) with the plot command as well. Set the "type"
argument to "l" (for lines), "b" (for both points and lines) or "o" (for overstrike).
library(lattice)
data(melanoma)
melanoma
year incidence
1 1936
0.9
2 1937
0.8
:
:
:
37 1972
4.8
plot(incidence~year,data=melanoma,type='l')
plot(incidence~year,data=melanoma,type='b')
plot(incidence~year,data=melanoma,type='o',main="Melanoma Incidence
by Year",ylab="melanoma incidence")
4. Trivariate Data – 3-D Scatterplots
frost<-state.x77[,7]
A 3-D scatterplot function is available in the package scatterplot3d. Remember, you'll
need to set the proxy server to install this from campus.
150
16
14
murder
200
library(scatterplot3d)
scatterplot3d(cbind(illiteracy, murder, frost))
scatterplot3d(cbind(illiteracy, murder, frost), type='h',
highlight.3d=T)
100
frost
12
10
8
50
6
4
0
2
0
0.5
1.0
1.5
2.0
2.5
3.0
illiteracy
The rgl package produces nice 3d images that can be rotated with your mouse for
superior interpretability. You'll also need to install this package.
library(rgl)
plot3d(illiteracy,frost,murder)
rgl.snapshot("C:\\...\\triscat1.png")
plot3d(illiteracy,frost,murder, type="s")
plot3d(illiteracy,frost,murder, type="s", size=.25)
rgl.snapshot("C:\\...\\triscat2.png")
Note that the rgl window cannot be cut-and-pasted to Word as easily as regular R
graphics. You'll need to call rgl.snapshot as above, to create a .png (image) file. Then
use Insert -> Picture -> From File to import the image into Word.
plot3d(illiteracy,frost,murder, type="s", size=.25, col="red")
rgl.snapshot("C:\\...\\triscat3.png")
plot3d(illiteracy,frost,murder, type="s", size=.25,
col=c("red","yellow","blue","green")[unclass(state.region)])
text3d(illiteracy,frost,murder+.25,text=state.name)
rgl.snapshot("C:\\...\\triscat4.png")
4. "Hypervariate" Data
William Cleveland noted that just as most of statistics changes fundamentally once we
increase the number of dimensions to 2 or more ("multivariate"), graphics must be done
in a fundamentally different way once we have 4 or more variables. He suggested the
term "hypervariate" for use in this context.
4a. Pairwise scatterplot matrices
pairs(iris[,1:4])
pairs(iris[,1:4],pch=16,col=unclass(species))
state<-state.x77[,2:7]
pairs(state)
pairs(state,pch=16,col=unclass(state.region))
4b. Star Plots
stars(state,key.loc=c(15,1.5))
stars(state,key.loc=c(15,1.5), col.stars=unclass(state.region)+1)
4c. Parallel Coordinate Plots
library(MASS)
parcoord(state,col=unclass(state.region))
legend("topleft",levels(state.region),lty=1,col=1:4)
parcoord(iris[,1:4],col=c(1,2,4)[unclass(species)])
legend("topleft",levels(species),lty=1,col=c(1,2,4))
5. Further (Built-In) Graphics Demos
R has a number of built-in demos of various features and packages. These not only
demonstrate abilities but show you the commands used to create them. You can see a list
with demo( ). Two of the most useful are the following.
demo(graphics)
demo(plotmath)
Download