PRESENTATION OF
MULTIVARIATE DATA
CATEGORICAL AND QUANTITATIVE
EXAMPLE
Old Versus Young
The effect of age and education on musical taste
can be investigated by breaking the observations
down into more homogenous groups. The most
obvious split is by age. There are 1300 older
people and 1000 younger people.
This is almost certainly a result of the way in which
the sample was taken.
EXAMPLE (contd.)
Education Level
Within the old and young groups we can now find
the proportions falling into each of the high and
low education categories.
The young group is clearly more highly educated
than the old group.
EXAMPLE (contd.)
Summary
The result of our “analysis” is a series of tables.
From these tables we can see:
1. There are slightly more old people than young
people in the sampled group.
2. The younger people are more highly educated
than the older ones.
3. The likelihood of listening to classical music
depends on both age and education level.
MOSAIC PLOTS
• Mosaic plots give a graphical representation of
these successive decompositions.
• Counts are represented by rectangles.
• At each stage of plot creation, the rectangles
are split parallel to one of the two axes.
MOSAIC PLOTS
Creating Mosaic Plots
• In order to produce a mosaic plot it is
necessary to have:
– A contingency table containing the data.
– A preferred ordering of the variables, with the
“response” variable last.
MOSAIC PLOTS
Entering the Data
• To enter the data, we must settle on the order in which
the values.
• The order of values in an R array is with the first
subscript varying most quickly, the second subscript
varying next most quickly, etc.
• In the case of the music data we can take the first
subscript to correspond to Age, the second to
Education and the third to Listening.
• The steps are then (i) entering the data, (ii) shaping it as
an array and (iii) labeling the extents.
Order for Data Entry
MOSAIC PLOTS
MOSAIC PLOTS
MOSAIC PLOTS
MOSAIC PLOTS
MOSAIC PLOTS
• Example: The mortality rates aboard the
Titanic, which are influenced strongly by age,
sex, and passenger class. If you wanted to
compare the mortality rates between men
and women using a mosaic plot, you would
first divide the unit square according to the
overall proportion of males and females.
TITANIC EXAMPLE
• The mortality rates aboard the Titanic, which
are influenced strongly by age, sex, and
passenger class. If you wanted to compare the
mortality rates between men and women
using a mosaic plot, you would first divide the
unit square according to the overall
proportion of males and females.
TITANIC EXAMPLE
•Roughly 35% of the
passengers were
female, so the first
split of the mosaic
plot is 35/65.
•Next, split each bar
vertically according to
the proportion who
lived and died.
TITANIC EXAMPLE
Among females,
67% survived
(coded as 1 on this
plot) and 33% died
(coded as 0). So the
female bar shows
as 67/33 split.
Among males, only
17% survived, so
this bar shows a
17/83 split.
• Most implementations of the mosaic plot
offer as a default a small margin around each
cell to make the graph easier to read.
•This plot shows you that males
were the majority of the deaths
and the minority of the
survivors.
•As a general recommendation,
variables that represent an
exposure or treatment status
should usually represent the first
split and variables that
represent an outcome should
represent the second split.
• Here is a mosaic plot looking at the
relationship between passenger class and
mortality.
The survival rate is best
among first class
passengers and worst
among third class
passengers.
JITTERED SCATTERPLOT
• The jitter function adds a small random
quantity to the data coordinates thus serving
to separate the overplotted points.
Jittering
Provides a clearer view of overlapping points.
Simpsons's Paradox
• Sometimes averages can be misleading.
Sometimes they don’t make any sense. Be
careful when averaging different variables.
• Let’s see the Centerville sign.
Entering Centerville
Established
Population
Elevation
Average
1793
7943
710
3482
Simpsons's Paradox - When Big Data
Sets Go Bad
By Smita Skrivanek, Principal Statistician, MoreSteam.com LLC
• It's a well accepted rule of thumb that the larger
the data set, the more reliable the conclusions
drawn.
• Simpson' paradox, however, demonstrates that a
great deal of care has to be taken when
combining small data sets into a large one.
Sometimes conclusions from the large data set
are exactly the opposite of conclusion from the
smaller sets. Unfortunately, the conclusions from
the large set are also usually wrong.
EXAMPLE
• You’re in charge of a study that compares how two
weight loss techniques – Diet and Exercise – affect the
weight loss of overweight patients.
• Overall, you had 240 patients participate in the study,
with 120 assigned to a weight‐loss diet and the
remaining 120 assigned to a supervised exercise
regimen.
• At the end of 30 days, you measured each group’s
weight loss. The data showed that 70 dieters and 57
exercisers lost significant weight, representing 58% in
the diet group and only 48% in the exercise group – a
significant difference. So, should you conclude that diet
is better than exercise?
• No, and this is why Simpson’s Paradox can be
so tricky! When the data are stratified instead
by the starting Body Mass Index (BMI) of the
participants, as shown below, clearer picture
emerges:
• When examined by BMI group, you can clearly
see that the percentage of patients who lost
weight in each BMI group was smaller among the
dieters than among the exercisers. The surprising
variable was the number of obese and severely
obese individuals between the diet and exercise
groups. Because those numbers were flipped, the
overall percentages of successful weight loss are
reversed (higher for the diet group).
• Simpsons’s Paradox at Work: The percentage of
patients who lost weight was higher for exercisers
among both obese and severely obese patients,
but when you aggregate the two groups, the
dieters appear to do better.
• Why Did this Happen?
• Two factors are at play here. First, there is an
overlooked confounding variable (BMI), and
second, a disproportionate allocation of BMI
levels among the experimental (diet and
exercise) groups. We do not know the reason
for the disproportionate allocation, but we
might guess that the patients somehow
self‐selected themselves into the two groups.
The proportions of dieters and exercisers in each BMI group
The proportions of weight loss and non‐weight loss patients
among the different subgroups
It is clear that more exercisers lost weight in each BMI group but
that in the aggregated sample the proportions seem to be
reversed.
EXAMPLE
• Suppose there are two pilots, Moe and Jill.
Moe argues that he is the better pilot of the
two, since he managed to land 83% of his last
120 flights on time compared with Jill’s 78%.
Let’s look at the data more closely. Here are
the results of their last 120 flights, broken
down by the time of the day they flew:
Time of Day
Pilots
Moe
Jill
Day
Night
Overall
90 out of 100
10out of 20
100 out of 120
90%
50%
83%
19 out of 20
75 out of 100
94 out of 120
95%
75%
78%
•Look at the day and night separately. For day flights, Jill had a
95% on time rate, and Moe only a 90% rate. At night, Jill was on
time 75% of the time, and Moe only 50%. So, Moe is better
“overall”, but Jill is better both during the day.
•Problem here is the unfair averaging over different groups. Jill
has mostly night flights, which are more difficult. So, her
average is heavily influenced by her night time average. Moe,
on the other hand, benefits from flying mostly during the day,
with higher on time percentage. With their different patterns of
flying conditions, taking an overall average is misleading.
How to Avoid the Paradox
• To avoid spurious results, it is always good
practice to examine whether the relationship in
the aggregated dataset holds up in it subsets,
especially when some groups are not equally
represented as others in the data.
• Another way may be to weight the samples
according to their sizes.
• Proper randomization also goes a long way in
minimizing the effects of a lurking variable that
might have been missed.
• Unfortunately, statistical analysis tools are just that –
tools to help you organize and analyze the observed
data. They cannot tell you anything about data that
were not observed or not included in the analysis.
• So it is very important to involve a cross‐functional
team and especially subject matter experts and
practitioners in the initial planning and selection of the
variables to be measured. After they collect the data,
the only way to try to avoid this pitfall is to visually and
otherwise examine meaningful subsets of the data.
• If you don’t have the option of planning the study but
are given the data from a database and asked to “find
what you can”, the lesson of Simpson's Paradox is to
always look at the data at several levels of aggregation,
as in the example above.
MULTIVARIATE CONTINUOUS DATA
•
•
•
•
Matrix scatter plots
Comparative boxplots
Comparative Violin plots
3D graphics
MATRIX SCATTER PLOTS
• Scatterplot matrix is an extension for
multidimensional data where a collection of
scatterplots is organized in a matrix
simultaneously to provide correlation
information among the attributes.
• We can easily observe patterns in the
relationships between pairs of attributes from
the matrix.
MATRIX SCATTER PLOTS
A scatterplot matrix for 5-dimensional data of 400 automobiles
Automobiles are color-coded by the number of cylinders. Manufacturers can
analyze the performance of the cars based on the number of cylinders for
improvements, while customers can decide how many cylinders they need
in order to suit their needs.
LIMITATIONS OF SCATTER PLOTS
• There may be important patterns in higher
dimensions which are barely recognized in it.
• It becomes chaotic when the number of points,
that is the number of data items, is too large.
• In that case brushing can be applied to address
this problem. Brushing aims interpretation by
highlighting a particular n-dimensional subspace
in the visualization, that is, the respective points
of interested are colored or highlighted in each
scatterplot in the matrix.
Mixed Binary-Continuous Plots
• We might be interested in knowing:
• How some binary variable Y covariates with
some continuous variable, X.
• How some binary variable Y is different for
different values of some binary variable X.
• We could use a standard scatterplot but . . .
A better approach is to add a smoothed representation of the relationship that
describes the density" of the data at various points on the X-axis by adding a
locally weighted regression (lowess) line.
The lowess line represents something like the predicted value of the Y -axis
variable at and around (conditional on) that value of X.
The line gives an idea of the general shape" of the data.
• When we have a continuous dependent variable and a binary
independent variable, we need to adopt another approach.
• Let's examine whether there is any observable differences in
adult HIV/AIDS prevalence rates between Saharan and subSaharan Africa.
We can see that sub-Saharan Africa:
• Has, on average, higher HIV/AIDS rates.
• Has greater variation in infection rates.
Contour Plots, Surface Plots, and other
3-D Plots
• Suppose we want to look at Muslim population percentage,
HIV rates, and literacy rates all at once.
• We could produce a contour plot { a representation of a
three-dimensional graph in two dimensions}.
• The contours tell us the level of HIV in those regions" defined by the contour lines.
• In general, we see the highest level of HIV rates in countries with high literacy rates
and low Muslim populations.
• We could also use a three-dimensional scatter
plot using R 's scatterplot3d package.
LOTS OF VARIABLES
• Things become much harder when we have four
or more variables. You might decide to
dichotomize or discretize one or more of your
variables.
• Suppose we want to know whether the
relationship between the Muslim percentage of
the population and HIV rates is moderated by the
presence of civil wars and country size.
• We could divide the countries into big and small
and produce a four-way scatterplot.
HIV/AIDS rate
True
False
Muslim % of the population
• 0 and 1 denote “little" and “big" countries respectively.
• “TRUE" and “FALSE" denote values for civil war.
• The negative HIV-Muslim population relationship holds for
small countries but not large ones, and no appreciable
differences between countries with internal conflict and those
without.