Describing Numerical Data

advertisement
Describing Numerical Data
1. Percentages
A percentage is simply a proportion multiplied by 100. Percentages make the
expression of the proportion easier and are used in three main ways:
i) To indicate the size of a subgroup.
e.g. 20 % of cars sold in 1996 were Fords.
ii) To express a change over time.
e.g.
Salaries increased by 5 per cent between 2003 and
2004.
iii) To express a change between two percentages
e.g. Interest rates rose by 1 percentage point.
In order to ensure that percentages are useful, it is vital that the actual figures
behind the percentages are quoted. It is also important to appreciate the
difference between a percentage change and a percentage point change.
The following example shows the three measures given above.
Example 1: Weekly Shopping Bill and Salary
Weekly Shopping
Bill
£30
Week 1
Week 2
£50
Weekly Salary
% of salary spent on
shopping
£120
30/120*100 = 25%
£120
50/120*100 = 41.7%
a) In week 1, the shopping bill took 25% of the salary and in week 2 it took
just under 42% of the salary.
b) The percentage increase, between weeks 1 and 2, of salary spent on
shopping was :
( 50 - 30 ) * 100 = 66.7 %
30
c) The percentage point change between weeks 1 and 2 was 16.7
percentage points:
41.7% - 25.0% = 16.7 percentage points
All three measures are equally valid and accurately represent the change in
spending between week 1 and week 2. The measure that is used depends on
the point being made and the statistic being used. If the statistic is itself a
percentage, such as the employment or unemployment rate (the number
unemployed as a percentage of all working age people) it is normally the
percentage point difference between two unemployment rates that is of
interest. If there is more interest in comparing the numbers of unemployed
people (the level of unemployment) then a simple percentage change is
appropriate.
2. Probability Distributions
When presented with a new or revised data set it makes sense to plot the
data in order to check that it behaves as expected and to understand its
properties. This would involve examining three main aspects of the data
distribution:

Location – check the location of the centre of the data on the x-axis.

Shape – is there one or more peaks in the data and are there
approximately
equal numbers of measurements on either side
of the peak?

Outliers – Are there any measurements which are much larger/smaller
than
the rest? These may be anomalies within the data set and
should
be examined closely.
(a) Skewed to the right
(b) Skewed to the left
(c) Normal Distribution (symmetric)
μ+σ
μ
μ+σ
(c) Normal distribution (Symmetrical)
(a) A distribution is skewed to the right when most of the measurements lie to
the
right hand side of the peak value.
(b) A distribution is skewed to the left when most of the measurements lie to
the left
hand side of the peak value.
(c) A distribution is symmetrical when there are an identical number of measurements on either
Many random variables in nature exhibit a bell-shaped curve as in (c) above.
This is known as a normal probability distribution and is widely used in
statistical analysis. The normal distribution is symmetrical about its mean, μ.
The total area under the normal probability distribution is 1, therefore the
symmetry means that the area to the right of the mean is equal to 0.5 and the
area to the left also equals 0.5. The shape of the distribution is determined
by the standard deviation, σ.


Large values of σ lead to a wider distribution with a lower height.
Small values of σ lead to narrow, tall distributions.
Many commonly used statistical tests assume that the data exhibits an
underlying normal probability distribution.
3. Measures of Central Value
There are three main ways to determine the central value in a data set – the
mean, median and mode. It is difficult to say which is the most useful as it
depends on the properties of the dataset in question. Regardless of which
measure is used it is important to clearly state which measure has been used
when reporting the results of analysis.
The mode is useful when data is recorded on a nominal scale as it simply
sums the number of responses in each category and reports the most
frequent. However, the mode is less efficient than the mean and median for
numeric data.
The mean and median are more difficult to distinguish between as they are
often very similar, especially if the data follows a normal distribution. The
median is generally the more useful of the two when dealing with small data
sets which contain extreme values.
Example 2
Dataset 1
Dataset 2
25,000
25,000
30,000
30,000
35,000
1,000,000
The median in each set of data is 30,000 but the mean values differ
considerably. For dataset 1 the mean is 30,000 whereas for dataset 2 the
mean is 351,666. This shows that the mean is more sensitive to the extreme
value of 1,000,000 in data set 2 than the median.
However, for larger samples the mean and median tend to be more similar.
Here, the mean is usually the better choice as it makes use of the actual
values in the data set rather than just their relative positions. Furthermore,
the mean is easier to compute as it has a simple algebraic formula and the
values do not need to be ordered. The diagrams below provide a visual
representation of the position of each of these measures of central value for
normally distributed data and for skewed data.
Normal Distribution
mean = median = mode
Right Skewed Distribution
Left Skewed Distribution
mean
mode
mode
mean
median
median
For data which follows an exact normally distribution, the mean, median and
mode are equal. However, in real-life this rarely occurs and there are slight
differences between the three measures.
For skewed data, the median is always positioned between the mean and the
mode because it is the ‘halfway point’. The mode always corresponds to the
peak of the distribution as it represents the most common value. The mean,
however, moves away from the median in the direction of the tail because of
its tendency to be affected by extreme values. Hence, with a right-skewed
distribution, the mean tends to be greater than the median because it is
pulled in the direction of the small number of large values. Similarly, with a
left-skewed distribution, the mean tends to be smaller than the median
because it is pulled towards the extremely small values.


Left skewed data
Right skewed data
Mean < Median < Mode
Mean > Median > Mode
A good illustration of this effect is income data. The mean is not always the
best measure of average income because it is overly-influenced by a small
number of very wealthy individuals and is not representative of the ‘typical’
level of income earned by the majority of individuals. In this case, the median
provides a better measure of ‘average’ income.
4. Measures of Dispersion
Measures of dispersion indicate how much variation or spread there is across
the data values. This is a very important measure when used in conjunction
with the mean as the two measures combined give a good description of the
data.
Example 3 - Dispersion
Dataset A
Dataset B
99
50
100
100
100
100
101
150
mean = 100
mean = 100
It is clear in the above example that there is much greater variation in the
data values than the means of 100 would indicate. Therefore some
quantitative measure of variation amongst the data variables needs to be
calculated.
The simplest measure is the range of the data values - the difference
between the largest and smallest values. However, this is a relatively crude
measure. Example 4 shows a possible problem.
Example 4: Number of books read per student
Student
A
B
C
D
E
No. History Books
1
5
5
6
6
No. Physics Books 1
2
2
2
3
F
6
9
G
6
9
H
7
10
I
7
11
J
11
11
The mean number of history books read by students is 6.0 and the range is
10.0. Similarly, the mean and range for the number of physics books read is
also 6.0 and 10.0. This would suggest that the two subjects have relatively
similar reading requirements. However, it would be wrong to describe the two
classes as being similar. In the History class the variation about the mean
value of 6.0 appears to be less pronounced than in the Physics class. We
therefore need a method for calculating the variation about the mean.
Mean Deviation and Variance
In order to measure the variation in the data, it is useful to measure the
difference between the mean and each of the data items. The greater the
variation, the larger the differences between the mean and the individual
items.
Example 5
Data
1
5
6
MEAN
4
Deviation from Mean
1–4=-3
5–4= 1
6–4= 2
The average of these deviations is (-3 +1+2)/3 = 0. Therefore a method must
be used which measures the deviation but ensures that average of the
deviations is not zero. The common method involves squaring the deviations
and taking an average of these squared values. This formula is called the
variance and can be written as:
Sum of ((x - mean) 2 )
No. of data values
or
Σ (x – μ)2 ; where x are the data
n
values
(( -3 ) 2 + ( 1 ) 2 + ( 2 ) 2 ) / 3 = 4.67
The variance is useful when comparing data sets. For example if the salaries
of two groups of people (groups A and B) are compared and the variance of
group A is larger than B then we know that the salaries in group A deviate
more from the mean than those of group B. This indicates how well the mean
value represents the dataset.
A common practice is to take a square root of the variance. This value is
known as the standard deviation. In example 5 above, the standard
deviation is 2.16. Example 6, below, shows the variance and standard
deviations for students’ reading habits.
Example 6
Student
A
B
C
D
E
F
G
H
I
No. History Books
1
5
5
6
6
6
6
7
7
Mean = 6.0 Range = 10.0 Variance = 5.4 Std. Deviation = 2.3
No. Physics Books 1
2
2
2
3
9
9
10
11
Mean = 6.0 Range = 10.0 Variance = 16.6 Std. Deviation = 4.07
J
11
11
The use of the variance and standard deviation statistics clearly show that the
reading levels in Physics are a lot more varied than those in History.
5. Measure of Position
A useful measure for describing where a data value lies relative to the other
values in the data set are percentiles. Percentiles split the data into 100
groups. If in an exam a candidate scores 25 marks, this does not tell you
anything about how others performed in the test and whether 25 is a high or
low score. However if the test score was the 99th percentile, this means that
99% of the people taking the test scored lower than 25 marks.
Use is also made of the term quartiles to explain data sets. Quartiles split the
data set into 4 groups and these groups are defined as the 25th percentile,
50th percentile, 75th percentile and the 100th percentile. The 50th percentile
is also known as the median.
Student
A
B
C
D
E
F
G
H
I
J
K
L
M
Score
23
25
30
30
34
37
46
50
53
58
62
68
72
Percentile
5
10
15
20
25
30
35
40
45
50
55
60
65
25th Percentile (1st Quartile)
Median (2nd Quartile)
N
O
P
Q
R
S
T
74
78
81
86
90
92
98
70
75
80
85
90
95
100
75th Percentile (3rd Quartile)
100th Percentile (4th Quartile)
Location of Quartiles
25%
25%
25%
25%
Median, m
Lower quartile, Q1
Further Information
Tier 1 Describing Numerical Data
Upper quartile, Q2
Download