Maximum likelihood estimation of regression parameters in statistical matching: a

advertisement
Maximum likelihood estimation of regression
parameters in statistical matching: a
comparison between different approaches
Marcello D’Orazio1 , Marco Di Zio2 , and Mauro Scanu3
1
2
3
Istituto Nazionale di Statistica, Roma, Italy madorazi@istat.it
Istituto Nazionale di Statistica, Roma, Italy dizio@istat.it
Istituto Nazionale di Statistica, Roma, Italy scanu@istat.it
Summary. This paper aims to study pros and cons in the use of maximum likelihood estimation for parametric regression models for the statistical matching of
continuous variables. We mainly compare the maximum likelihood approach with
two methods proposed in the statistical matching literature. A set of simulations are
performed in order to compare these methods.
Key words: data integration, data fusion, imputation
1 Introduction
Statistical matching plays a very important role when integrating two files, e.g.
see [DDS01] [DDS02]. The term statistical matching includes the set of methods
that aim to study the relationship between variables observed in different files and
never jointly observed. These techniques can be classified in two alternative ways.
The first one distinguishes between those methods aiming at the creation of a synthetic complete data set (micro objective) and those aiming at directly estimating
parameters (macro objective). The second takes into consideration the nature of the
variables to integrate. Regression or mixed methods (i.e. those methods that at first
use parametric regression and then nonparametric hot deck methods [SMK93]) are
mainly adopted for continuous variables, while categorical variables are based on
loglinear models.
This paper aims to study pros and cons in the use of maximum likelihood (ML)
estimation for parametric regression models for the statistical matching of continuous variables. We mainly compare ML with two methods proposed in the statistical matching literature, respectively [MS01] (henceforth MS method) and [Räs02]
(henceforth RegRieps). A set of simulations are performed in order to compare these
three methods and, additionally, two mixed methods differing only for the first step
(estimation of the regression parameters by MS and ML). The following experiments
deal only with the trivariate case; the extension to the general case can be found
in [DDS06].
1440
Marcello D’Orazio, Marco Di Zio, and Mauro Scanu
2 Methods
The statistical matching framework is the following: A is a data set of size nA where
(X, Y ) have been observed and B is a data set of size nB where (X, Z) have been
observed. We aim to integrate A and B in order to gain joint information on (Y, Z)
(or analogously on (Y, Z|X)). This can be done either creating at first a synthetic
complete data set where (X, Y, Z) are jointly available, or estimating directly relation
parameters on (Y, Z), e.g. the correlation coefficient ρY Z .
Let (X, Y, Z) be distributed normally with mean vector (µX , µY , µZ )′ and covariance matrix
0 2
1
σX σXY σXZ
2
Σ = σY X σY σY Z A
2
σZX σZY σZ
Given the available data sets A and B, σY Z (as well as σY Z|X ) is not estimable,
unless strong assumptions are imposed. Commonly, the conditional independence
2
assumption Y ⊥ Z|X (henceforth CIA) is assumed, i.e. σY Z = σXY σXZ /σX
, or
analogously σY Z|X = 0. In fact, in general, the regression equation of Y on X and
Z (analogous results hold for the regression equation of Z on X and Y ) should be:
Y = µY |XZ + ǫY |XZ = µY + βY X.Z (X − µX ) + βY Z.X (Z − µZ ) + ǫY |XZ ,
where
βY X.Z =
σXY |Z
2
σX|Z
βY Z.X =
σY Z|X
,
2
σZ|X
are the partial regression coefficients of Y on X and Z, and ǫY |XZ is normally
distributed with null mean vector and covariance matrix
σY2 |XZ = σY2 − (σXY σY Z )
2
σX
σXZ
2
σXZ σZ
−1 σXY
σY Z
.
In order to make inference on the previous parameters, three methods are illustrated:
ML, RegRieps and MS. A final mixed method is also shown.
2.1 Maximum likelihood
The likelihood function given sample A
L (θ|A ∪ B) =
nA
Y
a=1
nA
Y
B is:
B
Y
n
fY |X ya |xa ; θ Y |X
a=1
nB
fX (xa ; θ X )
S
Y
fZ|X zb |xb ; θ Z|X
b=1
fX (xb ; θ X ) ,
b=1
2
with θ X = (µX , σX
), θ T |X = (αT |X , βT |X , σT2 |X ), T = Y, Z. Hence, maximum
likelihood estimates of the parameters are the following, [LOR01].
(a) The X parameters are estimated on the entire A ∪ B:
2
= s2X;A∪B
µ̂X = x̄A∪B , σ̂X
ML estimation of regression parameters in statistical matching
1441
(b) The parameters of Y given X are estimated in two distinct phases. The first
one estimates the parameters of the regression equation of Y on X:
Y = µY |X + ǫY |X = αY + βY X X + ǫY |X
(1)
that is:
β̂Y X =
sXY ;A
, α̂Y = ȳA − β̂Y X x̄A , σ̂Y2 |X = s2Y ;A − β̂Y2 X s2X;A ,
s2X;A
where x̄A and ȳA are the sample means of X and Y in A, while s represents
the sample variance or covariance, according to the subscripts. The second step
consists in pooling the previous estimates with those in (a):
2
2
µ̂Y = α̂Y + β̂Y X µ̂X , σ̂Y2 = σ̂Y2 |X + β̂Y2 X σ̂X
, σ̂XY = β̂Y X σ̂X
.
Similar results hold for Z.
(c) Under the CIA, σ̂Y Z can be estimated pooling together the results of (a) and
(b). When the CIA cannot be assumed, auxiliary information is necessary. For
instance, if it is possible to assume ρY Z|X = ρ∗Y Z|X (by an ad hoc survey, or
historical data), σ̂Y Z|X is estimated by:
σ̂Y Z|X = ρ∗Y Z|X
q
2
σ̂Y2 |X σ̂Z|X
.
(d) The maximum likelihood estimate of σY Z , pooling the results in (a), (b), (c),
and (d) is:
σ̂Y X σ̂XZ
.
σ̂Y Z = σ̂Y Z|X +
2
σ̂X
2.2 RegRieps
Another method for the estimation of the parameters of Section 1 is that in [Räs02].
The parameters are estimated through the least squares method, and thus differ
from the maximum likelihood ones in the denominator of the estimated variances,
i.e. the difference between the sample size and the degree of freedom. This difference
is negligible for large samples.
The main difference is for the estimation of the residual variance. The estimate
performed by Rässler is obtained by the squared sum of the differences between the
observed values and the values predicted by the estimated regression function. For
the residual variance of the regression of Y on X, and Z on X, this estimate is the
same obtained with the maximum likelihood method from the estimates in (a) and
(b), Section 2.1. The difference is for the residual variance of Z on X and Y , and
Y on X and Z. For these components, in order to compute the residuals between
predicted and observed values some operations are needed. For the residual variance
of Z on X and Y , Rässler imputes Z̃ to the data set B according to the following
steps: first the missing variable Ỹ is imputed through the estimated regression of
Y on X and Z, and then the residuals between the values Z̃ and Z are computed,
where Z̃ is obtained by means of the regression of Z on X and Ỹ . This estimate
is affected by the fact that the predicted Z̃ values lack of all the variability of the
unobserved and imputed Y values. Analogously, the estimate of the residual variance
of Y on X and Z is obtained.
1442
Marcello D’Orazio, Marco Di Zio, and Mauro Scanu
2.3 Consistent methods
Another technique for the estimation of the regression parameters is that used in
[MS01]. They estimate the parameters by means of their observed counterpart, for
instance the estimation of the means with the average of the observed values, the
variances with the observed variances, and so on. For the residual variance they
suggest the following estimate:
2
σ̃Z|XY
= s2Z;B − (sXZ;B σY∗ Z )
s2X;A sXY ;A
sXY ;A s2Y ;A
−1 sXZ;B
σY∗ Z
,
where σY∗ Z is a postulated value. Since the construction of the residual variance is
done by using different subsets, it may present the problem of being negative (or,
when X, Y and Z are multivariate, negative definite).
2.4 Mixed methods
The regression explained in Section 2.3 is used by [MS01] as a mixed method. They
propose the following two steps:
(i) Regression step. Intermediate values for the units in A are computed through:
z̃a = µ̂Z +
σ̂ZX|Y
σ̂ZY |X
(xa − µ̂X ) + 2
(ya − µ̂Y ) + eZ|XY
2
σ̂XX|Y
σ̂Y |X
for each a = 1, . . . , nA , where eZ|XY is randomly drawn from a Gaussian dis2
tribution with zero mean and residual variance σ̂Z|XY
, and analogously for the
units in B
ỹb = µ̂Y +
σ̂Y X|Z
σ̂ZY |X
(xb − µ̂X ) + 2
(zb − µ̂Z ) + eY |XZ
2
σ̂X|Z
σ̂Z|X
for each b = 1, . . . , nB , and eY |XZ randomly drawn from a Gaussian distribution
with zero mean and residual variance σ̂Y2 |XZ .
(ii) Matching step. For each a = 1, . . . , nA , the term zb∗ corresponding to the
nearest b∗ in B is imputed. The distance is computed between the couples
d ((ya , z̃a ), (ỹb , zb )), and the suggested metric is the Mahalanobis. The matching
is constrained, i.e. in the distance computation it is considered also the number
of times that a record is chosen as donor [ROD01].
The maximum likelihood approach is expected to give better results than those
obtained by the techniques illustrated in Section 2.2 and 2.3. Since [WIL01], it was
asked about the gain of efficiency when using ML estimates instead of the sample
observed counterpart of the parameters to be estimated, as in MS. [LOR01] shows
that the gain in efficiency by using µ̂Y instead of ȳA for the parameter µY is:
V ar(µ̂Y )
nB
ρ2 .
= 1−
V ar(ȳA )
nA + nB XY
Thus, when both ρ2XY and the fraction of missing data in B are sufficiently large,
µ̂Y is more efficient.
ML estimation of regression parameters in statistical matching
1443
The use of ML estimators is expected to improve also the estimates of the residual
variance with respect to RegRieps and MS. In the first, the improvement is about
the underestimation of the residual variance of RegRieps, while for MS it is mainly
expected in terms of coherence of estimates, [MS04].
The experiments will be devoted to the comparison of these three methods of estimating regression parameters. Moreover, a comparison between the mixed method
in Section 2.4 with a mixed method where the regression step is performed according
to the ML estimates is carried out.
3 Simulation results
In [DDS05] an extensive simulation is carried out on trivariate normal distributions
with zero mean and unit variance. In this paper the attention is limited only to the
following correlation matrices:
0
ρ
(1)
1
1 0.7 0.7
= 0.7 1
0.7 A ,
0.7 0.7 1
0
ρ
(2)
1
1
0.5 0.95
= 0.5 1
0.7 A
0.95 0.7 1
where moderate (ρ(1) ) and high (ρ(2) ) correlation between X and Z is assumed. In
this way, it is shown how effective this relationship is for the statistical matching
problem. Other simulations for different ρY Z are investigated in [DDS05].
For each normal distribution a random sample of 1000 units is drawn. It is randomly split in two subsamples consisting of 500 units each. Then, in A the variable
Z is deleted, while the variable Y is deleted in B. The two files A and B are used as
input of the three previously introduced procedures (MS, RegRieps, ML) in order
to estimate the parameters of the normal distribution and, in particular, ρY Z . Obviously, given that ρY Z can not be estimated directly from the available data files, all
the procedures start by considering a preassigned value, ρ∗Y Z , for it (or equivalently a
value ρ∗Y Z|X is postulated for ρY Z|X ). This procedure is repeated 1 000 times for each
combination of each initial starting correlation matrix with the postulated values for
ρY Z (or ρY Z|X ). The 1 000 estimates provided by each method for the parameters
of the trivariate normal distribution are compared with the true parameters and the
following synthetic measures are computed: (i) the simulation bias (denoted as ‘Bias’
in tables), i.e. the average of the 1 000 estimates obtained for each parameter minus
the true parameter value; (ii) the simulation MSE (‘MSE’ in tables), obtained as the
average of the 1 000 squared differences between the estimated and true parameter
values.
In [DDS05] several different values for ρ∗Y Z are considered including the parameter value corresponding to the CIA (ρY Z|X =0). The results presented below are
limited to the case of ρ∗Y Z equal to respectively 0.49 (ρ∗Y Z|X = 0.0, i.e. the CIA) and
0.6940 for ρ(1) . As far as ρ(2) is concerned we considered ρ∗Y Z = 0.4750 (ρ∗Y Z|X = 0.0,
i.e. the CIA) and ρ∗Y Z = 0.69.
The results in Tables 1 and 2 show that ML performs slightly better than the
other methods in terms of both (simulation) Bias and MSE. This is more evident
when a high correlation between Y and Z is assumed. Note that RegRieps performances are close to ML method in the estimation of means and correlations while
some problems arise, as expected, in the estimation of the residual variances σY2 |XZ
1444
Marcello D’Orazio, Marco Di Zio, and Mauro Scanu
Table 1. Simulation results for the estimation of some parameters for the trivariate
normal distribution with correlation matrix ρ(1)
ρ∗Y Z = 0.49 (CIA)
MS RegRieps
ML
|Bias(µ̂Y )|
M SE(µ̂Y )|
|Bias(µ̂Z )|
M SE(µ̂Z )
|Bias(ρ̂XY )|
M SE(ρ̂XY )
|Bias(ρ̂XZ )|
M SE(ρ̂XZ )
|Bias(σ̂Y2 |XZ )|
M SE(σ̂Y2 |XZ )
2
|Bias(σ̂Z|XY
)|
2
M SE(σ̂Z|XY
)
0.00114
0.00198
0.00094
0.00207
0.00108
0.00106
0.00033
0.00097
0.07916
0.00932
0.07946
0.00937
0.00084
0.00151
0.00063
0.00154
0.00111
0.00045
0.00034
0.00045
0.09074
0.00923
0.09184
0.00952
0.00084
0.00151
0.00063
0.00154
0.00040
0.00045
0.00038
0.00045
0.08348
0.00794
0.08456
0.00822
ρ∗Y Z = 0.694
MS RegRieps
0.00210
0.00190
0.00091
0.00187
0.00059
0.00109
0.00092
0.00100
0.00257
0.00149
0.00026
0.00151
0.00162
0.00145
0.00050
0.00145
0.00027
0.00044
0.00074
0.00044
0.05559
0.00418
0.05367
0.00406
ML
0.00162
0.00145
0.00050
0.00145
0.00045
0.00044
0.00002
0.00044
0.00269
0.00070
0.00400
0.00078
Table 2. Simulation results for the estimation of some parameters for the trivariate
normal distribution with correlation matrix ρ(2)
ρ∗Y Z = 0.475 (CIA)
MS RegRieps
ML
|Bias(µ̂Y )|
M SE(µ̂Y )|
|Bias(µ̂Z )|
M SE(µ̂Z )
|Bias(ρ̂XY )|
M SE(ρ̂XY )
|Bias(ρ̂XZ )|
M SE(ρ̂XZ )
|Bias(σ̂Y2 |XZ )|
M SE(σ̂Y2 |XZ )
2
|Bias(σ̂Z|XY
)|
2
)
M SE(σ̂Z|XY
0.00134
0.00200
0.00060
0.00200
0.00070
0.00147
0.00035
0.00054
0.49994
0.25348
0.06468
0.00874
0.00089
0.00174
0.00042
0.00115
0.00080
0.00110
0.00002
0.00001
0.73044
0.53744
0.27534
0.07749
0.00089
0.00174
0.00042
0.00115
0.00005
0.00110
0.00021
0.00001
0.51846
0.27112
0.06700
0.00453
ρ∗Y Z = 0.69
MS RegRieps
0.00196 0.00179
0.00186 0.00169
0.00103 0.00134
0.00211 0.00110
0.00058 0.00041
0.00148 0.00111
0.00059 0.00001
0.00056 0.00001
0.03970 0.74542
0.02518 0.56181
0.01611 3.65431
0.00269 13.69306
ML
0.00179
0.00169
0.00134
0.00110
0.00034
0.00111
0.00019
0.00001
0.03826
0.00175
0.00493
0.00003
2
and σZ|XY
. The tables provide no results for the coefficient correlation ρY Z given
that, as expected, all the methods yield exactly the starting value, i.e. ρ∗Y Z in MS
and ρ∗Y Z|X in RegRieps and ML.
Additional simulations have been carried out in order to compare the mixed
methods (MS-mix in tables) suggested by [MS01] (see Section 2.4), with a similar
methods which uses ML parameter estimates in the regression steps (ML-mix). The
simulation steps are essentially the same as those before introduced with the further
introduction of a final donor step. At first A is considered as recipient and Z is
imputed. Then, the role of recipient is assigned to B and Y is imputed.
ML estimation of regression parameters in statistical matching
1445
Due to the heavy computational effort required by the application of constrained
donor imputation techniques, the entire simulation procedure has been iterated 500
times for each combination of the starting correlation matrix with the assumed value
for ρ∗Y Z (ρ∗Y Z = 0.475 and ρ∗Y Z = 0.69 are considered for ρ(1) while ρ∗Y Z = 0.502 and
ρ∗Y Z = 0.6913 are considered for ρ(1) ). The results of the simulation are summarized
in Tables 3 and 4.
All the simulations have been carried out in the R environment [R].
Table 3. Simulation results for the estimation of the correlation coefficients of ρ(1)
par
file
ρ∗Y Z = 0.694
ρ∗Y Z = 0.898
MS-mix ML-mix MS-mix ML-mix
ρXY
B |Bias|
MSE
A |Bias|
MSE
A,B |Bias|
MSE
0.01021
0.00269
0.00912
0.00249
0.01510
0.00068
ρXZ
ρY Z
0.00920
0.00099
0.01095
0.00101
0.01600
0.00089
0.00946
0.00218
0.00872
0.00223
0.18640
0.03494
0.00936
0.00078
0.01191
0.00089
0.18408
0.03412
Table 4. Simulation results for the estimation of the correlation coefficients of ρ(2)
par
file
ρ∗Y Z = 0.502
ρ∗Y Z = 0.6913
MS-mix ML-mix MS-mix ML-mix
ρXY
B |Bias|
MSE
A |Bias|
MSE
A,B |Bias|
MSE
0.00352
0.00233
0.01762
0.00185
0.20474
0.04241
ρXZ
ρY Z
0.00792
0.00165
0.01297
0.00020
0.20621
0.04390
0.00409
0.00232
0.02283
0.00157
0.02723
0.00461
0.00734
0.00150
0.01294
0.00021
0.01863
0.00137
The results in Tables 3 and 4 show that the two methods perform closely as
far as bias is concerned. ML-mix method seems slightly more stable given that its
simulation MSE is almost always smaller than MS-mixed MSE.
These results show that the ML approach comes out to be preferable, because of
the better performance in terms of estimation. This finding is also confirmed from
the simulation results reported in [DDS05]. Furthermore, the problem of coherence
of estimates from different files is naturally resolved.
References
[DDS01] D’Orazio, M., Di Zio, M., Scanu, M.: Statistical Matching: a tool for
integrating data in National Statistical Institutes. Proceedings NTTS -
1446
Marcello D’Orazio, Marco Di Zio, and Mauro Scanu
[DDS02]
[DDS05]
[DDS06]
[LOR01]
[MS01]
[MS04]
[Räs02]
[R]
[ROD01]
[SMK93]
[WIL01]
ETK 2001, Hersonissos (Greece) 18 - 22 June 2001, volume I, 433–441
(2001)
D’Orazio, M., Di Zio, M., Scanu, M.: Statistical Matching and Official
Statistics. Rivista di Statistica Ufficiale 1, 5–24 (2002)
D’Orazio, M., Di Zio, M., Scanu, M.: A comparison among different estimators of regression parameters on statistically matched files through an
extensive simulation study. Istituto Nazionale di Statistica, Roma Contributi 2005/10 (2005)
D’Orazio, M., Di Zio, M., Scanu, M.: Statistical Matching: Theory and
Practice. Wiley, Chichester (2006)
Lord, F. M.: Estimation of parameters from incomplete data. Journal of
the American Statistical Association 50, 870–876 (1955)
Moriarity, C., Scheuren, F.: Statistical Matching: a Paradigm for Assessing
the Uncertainty in the Procedure. Journal of Official Statistics, 17, 407–
422 (2001)
Moriarity, C., Scheuren, F.: Regression–based statistical matching: recent
developments. Proceedings of the Section on Survey Research Methods,
American Statistical Association (2004).
Rässler, S.: Statistical Matching: a Frequentist Theory, Practical Applications and Alternative Bayesian Approaches. Springer, New York (2002)
R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
(2005)
Rodgers, W. L.: An Evaluation of Statistical Matching. Journal of Business and Economic Statistics 2, 91–102 (1984)
Singh, A., Mantel, H., Kinack, M., Rowe, G.: Statistical matching: use of
auxiliary information as an alternative to the conditional independence
assumption. Survey Methodology, 19, 59–79 (1993)
Wilks, S. S.: Moments and distributions of estimates of population parameters from fragmentary samples. Annals of Mathematical Statistics 3,
163–194 (1932)
Download