Lecture Notes 4 Convergence (Chapter 5) 1

advertisement
Lecture Notes 4
Convergence (Chapter 5)
1
Random Samples
Let X1 , . . . , Xn ∼ F . A statistic is any function Tn = g(X1 , . . . , Xn ). Recall that the sample
mean is
n
1X
Xi
Xn =
n i=1
and sample variance is
n
Sn2
1 X
(Xi − X n )2 .
=
n − 1 i=1
Let µ = E(Xi ) and σ 2 = Var(Xi ). Recall that
E(X n ) = µ,
Var(X n ) =
σ2
,
n
E(Sn2 ) = σ 2 .
Theorem 1 If X1 , . . . , Xn ∼ N (µ, σ 2 ) then X n ∼ N (µ, σ 2 /n).
Proof. We know that MXi (s) = eµs+σ
t
MX n (t) = E(etX n ) = E(e n
Pn
i=1
Xi
2 s2 /2
. So,
)
= (EetXi /n )n = (MXi (t/n))n = e(µt/n)+σ
2 t2 /(2n2 )
n
(
= exp µt +
σ2 2
t
n
)
2
which is the mgf of a N (µ, σ 2 /n). Example 2 Let X(1) , . . . , X(n) denoted the ordered values:
X(1) ≤ X(2) ≤ · · · ≤ X(n) .
Then X(1) , . . . , X(n) are called the order statistics. And Tn = (X(1) , . . . , X(n) ) is a statistic.
1
2
Convergence
Let X1 , X2 , . . . be a sequence of random variables and let X be another random variable.
Let Fn denote the cdf of Xn and let F denote the cdf of X. We are going to study different
types of convergence.
Example: A good example to keep in mind is the folllowing. Let Y1 , Y2 , . . . , be a
sequence of iid random variables. Let
n
1X
Yi
Xn =
n i=1
be the average of the first n of the Yi ’s. This defines a new sequence X1 , X2 , . . . ,. In other
words, the sequence of interest X1 , X2 , . . . , might be a sequence of statistics based on some
other sequence of iid random variables. Note that the original sequence Y1 , Y2 , . . . , is
iid but the sequence X1 , X2 , . . . , is not iid.
a.s.
1. Xn converges almost surely to X, written Xn → X, if, for every > 0,
P lim |Xn − X| < = 1.
n→∞
(1)
a.s.
Xn converges almost surely to a constant c, written Xn → c, if, for every > 0,
P lim Xn = c = 1.
(2)
n→∞
P
2. Xn converges to X in probability, written Xn → X, if, for every > 0,
P(|Xn − X| > ) → 0
(3)
as n → ∞. In other words, Xn − X = oP (1).
P
Xn converges to c in probability, written Xn → c, if, for every > 0,
P(|Xn − c| > ) → 0
(4)
as n → ∞. In other words, Xn − c = oP (1).
3. Xn converges to X in quadratic mean (also called convergence in L2 ), written
qm
Xn → X, if
E(Xn − X)2 → 0
(5)
as n → ∞.
qm
Xn converges to c in quadratic mean, written Xn → c, if
E(Xn − c)2 → 0
as n → ∞.
2
(6)
4. Xn converges to X in distribution, written Xn
X, if
lim Fn (t) = F (t)
(7)
n→∞
at all t for which F is continuous.
Xn converges to c in distribution, written Xn
c, if
lim Fn (t) = δc (t)
n→∞
(8)
at all t 6= c where δc (t) = 0 if t < c and δc (t) = 1 if t ≥ c.
Theorem 3 Convergence in probability does not imply almost sure convergence.
Proof. Let Ω = [0, 1]. Let P be uniform on [0, 1]. We draw S ∼ P . Let X(s) = s and
let
X1 = s + I[0,1] (s),
X4 = s + I[0,1/3] (s),
X2 = s + I[0,1/2] (s), X3 = s + I[1/2,1] (s)
X5 = s + I[1/3,2/3] (s), X6 = s + I[2/3,1] (s)
P
etc. Then Xn → X. But, for each s, Xn (s) does not converge to X(s). Hence, Xn does not
converge almost surely to X. In fact, P ({s ∈ Ω : limn Xn (s) = X(s)}) = 0. Example 4 Let Xn ∼ N (0, 1/n). Intuitively, Xn is concentrating at 0 so we would like to
say that Xn converges to 0. Let’s
√ see if this is true. Let F be the distribution function for
a point mass at 0. Note that nXn ∼ N (0, 1). Let Z denote a standard normal random
variable. For t < 0,
√
√
√
Fn (t) = P(Xn < t) = P( nXn < nt) = P(Z < nt) → 0
√
since nt → −∞. For t > 0,
√
√
√
Fn (t) = P(Xn < t) = P( nXn < nt) = P(Z < nt) → 1
√
since nt → ∞. Hence, Fn (t) → F (t) for all t 6= 0 and so Xn
0. Notice that Fn (0) =
1/2 6= F (1/2) = 1 so convergence fails at t = 0. That doesn’t matter because t = 0 is
not a continuity point of F and the definition of convergence in distribution only requires
convergence at continuity points.
Now consider convergence in probability. For any > 0, using Markov’s inequality,
1
E(Xn2 )
n
P(|Xn | > ) = P(|Xn | > ) ≤
= 2 →0
2
2
2
P
as n → ∞. Hence, Xn → 0.
3
The next theorem gives the relationship between the types of convergence.
Theorem 5 The following relationships hold:
qm
P
(a) Xn → X implies that Xn → X.
P
(b) Xn → X implies that Xn
X.
P
(c) If Xn
X and if P(X = c) = 1 for some real number c, then Xn → X.
as
P
(d) Xn → X implies Xn → X.
In general, none of the reverse implications hold except the special case in (c).
qm
Proof. We start by proving (a). Suppose that Xn → X. Fix > 0. Then, using
Markov’s inequality,
P(|Xn − X| > ) = P(|Xn − X|2 > 2 ) ≤
E|Xn − X|2
→ 0.
2
Proof of (b). Fix > 0 and let x be a continuity point of F . Then
Fn (x) = P(Xn ≤ x) = P(Xn ≤ x, X ≤ x + ) + P(Xn ≤ x, X > x + )
≤ P(X ≤ x + ) + P(|Xn − X| > )
= F (x + ) + P(|Xn − X| > ).
Also,
F (x − ) = P(X ≤ x − ) = P(X ≤ x − , Xn ≤ x) + P(X ≤ x − , Xn > x)
≤ Fn (x) + P(|Xn − X| > ).
Hence,
F (x − ) − P(|Xn − X| > ) ≤ Fn (x) ≤ F (x + ) + P(|Xn − X| > ).
Take the limit as n → ∞ to conclude that
F (x − ) ≤ lim inf Fn (x) ≤ lim sup Fn (x) ≤ F (x + ).
n→∞
n→∞
This holds for all > 0. Take the limit as → 0 and use the fact that F is continuous at x
and conclude that limn Fn (x) = F (x).
Proof of (c). Fix > 0. Then,
P(|Xn − c| > ) =
≤
=
→
=
P(Xn < c − ) + P(Xn > c + )
P(Xn ≤ c − ) + P(Xn > c + )
Fn (c − ) + 1 − Fn (c + )
F (c − ) + 1 − F (c + )
0 + 1 − 1 = 0.
4
Proof of (d). Omitted.
Let us now show that the reverse implications do not hold.
Convergence in
√in quadratic mean. Let U ∼ Unif(0, 1)
√probability does not imply convergence
and let Xn = nI(0,1/n) (U ). Then P(|Xn | > ) = P( nI(0,1/n) (U ) > ) = P(0 ≤ U < 1/n) =
R 1/n
P
1/n → 0. Hence, Xn → 0. But E(Xn2 ) = n 0 du = 1 for all n so Xn does not converge in
quadratic mean.
Convergence in distribution does not imply convergence in probability. Let X ∼ N (0, 1).
Let Xn = −X for n = 1, 2, 3, . . .; hence Xn ∼ N (0, 1). Xn has the same distribution
function as X for all n so, trivially, limn Fn (x) = F (x) for all x. Therefore, Xn
X. But
P(|Xn − X| > ) = P(|2X| > ) = P(|X| > /2) 6= 0. So Xn does not converge to X in
probability. The relationships between the types of convergence can be summarized as follows:
q.m.
↓
a.s. → prob → distribution
P
Example 6 One might conjecture that if Xn → b, then E(Xn ) → b. This is not true.
Let Xn be a random variable defined by P(Xn = n2 ) = 1/n and P(Xn = 0) = 1 − (1/n).
P
Now, P(|Xn | < ) = P(Xn = 0) = 1 − (1/n) → 1. Hence, Xn → 0. However, E(Xn ) =
[n2 × (1/n)] + [0 × (1 − (1/n))] = n. Thus, E(Xn ) → ∞.
Example 7 Let X1 , . . . , Xn ∼ Uniform(0, 1). Let X(n) = maxi Xi . First we claim that
P
X(n) → 1. This follows since
P(|X(n) − 1| > ) = P(X(n) ≤ 1 − ) =
Y
P(Xi ≤ 1 − ) = (1 − )n → 0.
i
Also
P(n(1 − X(n) ) ≤ t) = P(X(n) ≥ 1 − (t/n)) = 1 − (1 − t/n)n → 1 − e−t .
So n(1 − X(n) )
Exp(1).
Some convergence properties are preserved under transformations.
Theorem 8 Let Xn , X, Yn , Y be random variables.
P
P
P
(a) If Xn → X and Yn → Y , then Xn + Yn → X + Y .
qm
qm
qm
(b) If Xn → X and Yn → Y , then Xn + Yn → X + Y .
P
P
P
(c) If Xn → X and Yn → Y , then Xn Yn → XY .
5
In general, Xn
X and Yn
are cases when it does:
Y does not imply that Xn + Yn
Theorem 9 (Slutzky’s Theorem) If Xn
If Xn
X and Yn
c, then Xn Yn
cX.
X and Yn
X + Y . But there
c, then Xn + Yn
X + c. Also,
Theorem 10 (The Continuous Mapping Theorem) Let Xn , X, Yn , Y be random variables. Let g be a continuous function.
P
P
(a) If Xn → X, then g(Xn ) → g(X).
(b) If Xn
X, then g(Xn )
g(X).
Exercise: Prove the continuous mapping theorem.
3
The Law of Large Numbers
The law of large numbers (LLN) says that the mean of a large sample is close to the mean
of the distribution. For example, the proportion of heads of a large number of tosses of a
fair coin is expected to be close to 1/2. We now make this more precise.
Let X1 , X2 , . . . be an iid sample,
let µ = E(X1 ) and σ 2 = Var(X1 ). Recall that the sample
P
n
mean is defined as X n = n−1 i=1 Xi and that E(X n ) = µ and Var(X n ) = σ 2 /n.
Theorem 11 (The Weak Law of Large Numbers (WLLN))
P
If X1 , . . . , Xn are iid, then X n → µ. Thus, X n − µ = oP (1).
Interpretation of the WLLN: The distribution of X n becomes more concentrated
around µ as n gets large.
Proof. Assume that σ < ∞. This is not necessary but it simplifies the proof. Using
Chebyshev’s inequality,
Var(X n )
σ2
=
P |X n − µ| > ≤
2
n2
which tends to 0 as n → ∞. Theorem 12 The Strong Law of Large Numbers. Let X1 , . . . , Xn be iid with mean µ.
as
Then X n → µ.
The proof is beyond the scope of this course.
6
4
The Central Limit Theorem
The law of large numbers says that the distribution of X n piles up near µ. This isn’t enough
to help us approximate probability statements about X n . For this we need the central limit
theorem.
2
Suppose that X1 , . . . , Xn are iid
P with mean µ and variance σ . The central limit the−1
orem (CLT) says that X n = n
i Xi has a distribution which is approximately Normal
with mean µ and variance σ 2 /n. This is remarkable since nothing is assumed about the
distribution of Xi , except the existence of the mean and variance.
Theorem 13 (The Central Limit
P Theorem (CLT)) Let X1 , . . . , Xn be iid with mean µ
and variance σ 2 . Let X n = n−1 ni=1 Xi . Then
√
Xn − µ
n(X n − µ)
Zn ≡ q
=
Z
σ
Var(X n )
where Z ∼ N (0, 1). In other words,
Z
z
lim P(Zn ≤ z) = Φ(z) =
n→∞
−∞
1
2
√ e−x /2 dx.
2π
Interpretation: Probability statements about X n can be approximated using a Normal distribution. It’s the probability statements that we are approximating, not the
random variable itself.
Remark: We often write
Xn ≈ N
as short form for
√
n(X n − µ)
σ2
µ,
n
N (0, 1).
Recall that if X is a random variable, its moment generating function (mgf) is ψX (t) =
EetX . Assume in what follows that the mgf is finite in a neighborhood around t = 0.
Lemma 14 Let Z1 , Z2 , . . . be a sequence of random variables. Let ψn be the mgf of Zn . Let
Z be another random variable and denote its mgf by ψ. If ψn (t) → ψ(t) for all t in some
open interval around 0, then Zn
Z.
P
−1/2
Proof of the central limit theorem.
Let
Y
=
(X
−
µ)/σ.
Then,
Z
=
n
i
i
n
i Yi .
P
√
Let ψ(t) be the mgf of Yi . The mgf of i Yi is (ψ(t))n and mgf of Zn is [ψ(t/ n)]n ≡ ξn (t).
Now ψ 0 (0) = E(Y1 ) = 0, ψ 00 (0) = E(Y12 ) = Var(Y1 ) = 1. So,
t2 00
t3 000
ψ(t) = ψ(0) + tψ (0) + ψ (0) + ψ (0) + · · ·
2!
3!
t2 t3 000
= 1 + 0 + + ψ (0) + · · ·
2
3!
t2 t3 000
= 1 + + ψ (0) + · · ·
2
3!
0
7
Now,
n
t
ξn (t) = ψ √
n
n
2
t
t3
000
= 1+
+
ψ (0) + · · ·
2n 3!n3/2
#n
"
t3
t2
000
+
ψ
(0)
+
·
·
·
1/2
3!n
=
1+ 2
n
2 /2
→ et
which is the mgf of a N(0,1). The result follows from Lemma 14. In the last step we used
the fact that if an → a then
an n
1+
→ ea . n
√
The central limit theorem tells us that Zn = n(X n − µ)/σ is approximately N(0,1).
However, we rarely know σ. We can estimate σ 2 from X1 , . . . , Xn by
n
Sn2
1 X
(Xi − X n )2 .
=
n − 1 i=1
This raises the following question: if we replace σ with Sn , is the central limit theorem still
true? The answer is yes.
Theorem 15 Assume the same conditions as the CLT. Then,
√
n(X n − µ)
Tn =
N (0, 1).
Sn
Proof. Here is a brief proof. We have that
Tn = Zn Wn
where
√
Zn =
n(X n − µ)
σ
and
Wn =
Now Zn
P
σ
.
Sn
N (0, 1) and Wn → 1. The result follows from Slutzky’s theorem. 8
Here is an extended proof.
P
Step 1. We first show that Rn2 → σ 2 where
n
Rn2
1X
(Xi − X n )2 .
=
n i=1
Note that
n
1X 2
2
Rn =
X −
n i=1 i
n
1X
Xi
n i=1
!2
.
Define Yi = Xi2 . Then, using the LLN (law of large numbers)
n
n
1X
1X 2
P
Xi =
Yi → E(Yi ) = E(Xi2 ) = µ2 + σ 2 .
n i=1
n i=1
Next, by the LLN,
n
1X
P
Xi → µ.
n i=1
Since g(t) = t2 is continuous, the continuous mapping theorem implies that
!2
n
1X
P
Xi
→ µ2 .
n i=1
Thus
P
Rn2 → (µ2 + σ 2 ) − µ2 = σ 2 .
Step 2. Note that
Sn2
=
n
n−1
P
Rn2 .
P
Since, Rn2 → σ 2 and n/(n − 1) → 1, we have that Sn2 → σ 2 .
Step 3. Since g(t) =
P
implies that Sn → σ.
√
t is continuous, (for t ≥ 0) the continuous mapping theorem
Step 4. Since g(t) = t/σ is continuous, the continuous mapping theorem implies that
P
Sn /σ → 1.
Step 5. Since g(t) = 1/t is continuous (for t > 0) the continuous mapping theorem
P
implies that σ/Sn → 1. Since convergence in probability implies convergence in distribution,
σ/Sn
1.
9
Step 5. Note that
√
Tn =
n(X n − µ)
σ
σ
Sn
≡ Vn Wn .
Now Vn
Z where Z ∼ N (0, 1) by the CLT. And we showed that Wn
1. By Slutzky’s
theorem, Tn = Vn Wn
Z × 1 = Z.
The next result is very important. It tells us how close the distribution of X is to the
Normal distribution.
Theorem 16 (Berry-Esseen Theorem) Let X1 , . . . , Xn ∼ P . Let µ = E[Xi ] and σ 2 =
Var[Xi ]. Assume that µ3 = E[|Xi − µ|3 ] < ∞. Let
√
n(X n − µ)
≤z .
Fn (z) = P
σ
Then
sup |Fn (z) − Φ(z)| ≤
z
33 µ3
√ .
4 σ3 n
There is also a multivariate version of the central limit theorem. Recall that X =
(X1 , . . . , Xk )T has a multivariate Normal distribution with mean vector µ and covariance
matrix Σ if
1
1
T −1
f (x) =
exp − (x − µ) Σ (x − µ) .
(2π)k/2 |Σ|1/2
2
In this case we write X ∼ N (µ, Σ).
Theorem 17 (Multivariate central limit theorem) Let X1 , . . . , Xn be iid random vectors where Xi = (X1i , . . . , Xki )T withP
mean µ = (µ1 , . . . , µk )T and covariance matrix Σ. Let
X = (X 1 , . . . , X k )T where X j = n−1 ni=1 Xji . Then,
√
n(X − µ)
N (0, Σ).
Remark: There is also a multivariate version of the Berry-Esseen theorem but it is more
complicated than the one-dimensional version.
5
The Delta Method
If Yn has a limiting Normal distribution then the delta method allows us to find the limiting
distribution of g(Yn ) where g is any smooth function.
10
Theorem 18 (The Delta Method) Suppose that
√
n(Yn − µ)
N (0, 1)
σ
and that g is a differentiable function such that g 0 (µ) 6= 0. Then
√
n(g(Yn ) − g(µ))
N (0, 1).
|g 0 (µ)|σ
In other words,
σ2
Yn ≈ N µ,
n
!
!
2
σ
g(Yn ) ≈ N g(µ), (g 0 (µ))2
.
n
implies that
Example 19 Let
variance σ 2 . By the central
√ X1 , . . . , Xn be iid with finite mean µ and Xfinite
limit theorem, n(X n − µ)/σ
N (0, 1). Let Wn = e n . Thus, Wn = g(X n ) where
s
0
s
g(s) = e . Since g (s) = e , the delta method implies that Wn ≈ N (eµ , e2µ σ 2 /n).
There is also a multivariate version of the delta method.
Theorem 20 (The Multivariate Delta Method) Suppose that Yn = (Yn1 , . . . , Ynk ) is a
sequence of random vectors such that
√
n(Yn − µ)
N (0, Σ).
Let g : Rk → R and let

∂g
∂y1



∇g(y) =  ...  .
∂g
∂yk
Let ∇µ denote ∇g(y) evaluated at y = µ and assume that the elements of ∇µ are nonzero.
Then
√
n(g(Yn ) − g(µ))
N 0, ∇Tµ Σ∇µ .
Example 21 Let
X11
X21
X12
X1n
,
, ...,
X22
X2n
be iid random vectors with mean µ = (µ1 , µ2 )T and variance Σ. Let
n
n
1X
X1 =
X1i ,
n i=1
1X
X2 =
X2i
n i=1
and define Yn = X 1 X 2 . Thus, Yn = g(X 1 , X 2 ) where g(s1 , s2 ) = s1 s2 . By the central limit
theorem,
√
X 1 − µ1
n
N (0, Σ).
X 2 − µ2
11
Now
∇g(s) =
∂g
∂s1
∂g
∂s2
!
=
s2
s1
and so
∇Tµ Σ∇µ
= (µ2 µ1 )
σ11 σ12
σ12 σ22
µ2
µ1
= µ22 σ11 + 2µ1 µ2 σ12 + µ21 σ22 .
Therefore,
√
n(X 1 X 2 − µ1 µ2 )
N
0, µ22 σ11
12
+ 2µ1 µ2 σ12 +
µ21 σ22
.
Download