Maximum likelihood estimation of regression parameters in statistical matching: a comparison between different approaches Marcello D’Orazio1 , Marco Di Zio2 , and Mauro Scanu3 1 2 3 Istituto Nazionale di Statistica, Roma, Italy madorazi@istat.it Istituto Nazionale di Statistica, Roma, Italy dizio@istat.it Istituto Nazionale di Statistica, Roma, Italy scanu@istat.it Summary. This paper aims to study pros and cons in the use of maximum likelihood estimation for parametric regression models for the statistical matching of continuous variables. We mainly compare the maximum likelihood approach with two methods proposed in the statistical matching literature. A set of simulations are performed in order to compare these methods. Key words: data integration, data fusion, imputation 1 Introduction Statistical matching plays a very important role when integrating two files, e.g. see [DDS01] [DDS02]. The term statistical matching includes the set of methods that aim to study the relationship between variables observed in different files and never jointly observed. These techniques can be classified in two alternative ways. The first one distinguishes between those methods aiming at the creation of a synthetic complete data set (micro objective) and those aiming at directly estimating parameters (macro objective). The second takes into consideration the nature of the variables to integrate. Regression or mixed methods (i.e. those methods that at first use parametric regression and then nonparametric hot deck methods [SMK93]) are mainly adopted for continuous variables, while categorical variables are based on loglinear models. This paper aims to study pros and cons in the use of maximum likelihood (ML) estimation for parametric regression models for the statistical matching of continuous variables. We mainly compare ML with two methods proposed in the statistical matching literature, respectively [MS01] (henceforth MS method) and [Räs02] (henceforth RegRieps). A set of simulations are performed in order to compare these three methods and, additionally, two mixed methods differing only for the first step (estimation of the regression parameters by MS and ML). The following experiments deal only with the trivariate case; the extension to the general case can be found in [DDS06]. 1440 Marcello D’Orazio, Marco Di Zio, and Mauro Scanu 2 Methods The statistical matching framework is the following: A is a data set of size nA where (X, Y ) have been observed and B is a data set of size nB where (X, Z) have been observed. We aim to integrate A and B in order to gain joint information on (Y, Z) (or analogously on (Y, Z|X)). This can be done either creating at first a synthetic complete data set where (X, Y, Z) are jointly available, or estimating directly relation parameters on (Y, Z), e.g. the correlation coefficient ρY Z . Let (X, Y, Z) be distributed normally with mean vector (µX , µY , µZ )′ and covariance matrix 0 2 1 σX σXY σXZ 2 Σ = σY X σY σY Z A 2 σZX σZY σZ Given the available data sets A and B, σY Z (as well as σY Z|X ) is not estimable, unless strong assumptions are imposed. Commonly, the conditional independence 2 assumption Y ⊥ Z|X (henceforth CIA) is assumed, i.e. σY Z = σXY σXZ /σX , or analogously σY Z|X = 0. In fact, in general, the regression equation of Y on X and Z (analogous results hold for the regression equation of Z on X and Y ) should be: Y = µY |XZ + ǫY |XZ = µY + βY X.Z (X − µX ) + βY Z.X (Z − µZ ) + ǫY |XZ , where βY X.Z = σXY |Z 2 σX|Z βY Z.X = σY Z|X , 2 σZ|X are the partial regression coefficients of Y on X and Z, and ǫY |XZ is normally distributed with null mean vector and covariance matrix σY2 |XZ = σY2 − (σXY σY Z ) 2 σX σXZ 2 σXZ σZ −1 σXY σY Z . In order to make inference on the previous parameters, three methods are illustrated: ML, RegRieps and MS. A final mixed method is also shown. 2.1 Maximum likelihood The likelihood function given sample A L (θ|A ∪ B) = nA Y a=1 nA Y B is: B Y n fY |X ya |xa ; θ Y |X a=1 nB fX (xa ; θ X ) S Y fZ|X zb |xb ; θ Z|X b=1 fX (xb ; θ X ) , b=1 2 with θ X = (µX , σX ), θ T |X = (αT |X , βT |X , σT2 |X ), T = Y, Z. Hence, maximum likelihood estimates of the parameters are the following, [LOR01]. (a) The X parameters are estimated on the entire A ∪ B: 2 = s2X;A∪B µ̂X = x̄A∪B , σ̂X ML estimation of regression parameters in statistical matching 1441 (b) The parameters of Y given X are estimated in two distinct phases. The first one estimates the parameters of the regression equation of Y on X: Y = µY |X + ǫY |X = αY + βY X X + ǫY |X (1) that is: β̂Y X = sXY ;A , α̂Y = ȳA − β̂Y X x̄A , σ̂Y2 |X = s2Y ;A − β̂Y2 X s2X;A , s2X;A where x̄A and ȳA are the sample means of X and Y in A, while s represents the sample variance or covariance, according to the subscripts. The second step consists in pooling the previous estimates with those in (a): 2 2 µ̂Y = α̂Y + β̂Y X µ̂X , σ̂Y2 = σ̂Y2 |X + β̂Y2 X σ̂X , σ̂XY = β̂Y X σ̂X . Similar results hold for Z. (c) Under the CIA, σ̂Y Z can be estimated pooling together the results of (a) and (b). When the CIA cannot be assumed, auxiliary information is necessary. For instance, if it is possible to assume ρY Z|X = ρ∗Y Z|X (by an ad hoc survey, or historical data), σ̂Y Z|X is estimated by: σ̂Y Z|X = ρ∗Y Z|X q 2 σ̂Y2 |X σ̂Z|X . (d) The maximum likelihood estimate of σY Z , pooling the results in (a), (b), (c), and (d) is: σ̂Y X σ̂XZ . σ̂Y Z = σ̂Y Z|X + 2 σ̂X 2.2 RegRieps Another method for the estimation of the parameters of Section 1 is that in [Räs02]. The parameters are estimated through the least squares method, and thus differ from the maximum likelihood ones in the denominator of the estimated variances, i.e. the difference between the sample size and the degree of freedom. This difference is negligible for large samples. The main difference is for the estimation of the residual variance. The estimate performed by Rässler is obtained by the squared sum of the differences between the observed values and the values predicted by the estimated regression function. For the residual variance of the regression of Y on X, and Z on X, this estimate is the same obtained with the maximum likelihood method from the estimates in (a) and (b), Section 2.1. The difference is for the residual variance of Z on X and Y , and Y on X and Z. For these components, in order to compute the residuals between predicted and observed values some operations are needed. For the residual variance of Z on X and Y , Rässler imputes Z̃ to the data set B according to the following steps: first the missing variable Ỹ is imputed through the estimated regression of Y on X and Z, and then the residuals between the values Z̃ and Z are computed, where Z̃ is obtained by means of the regression of Z on X and Ỹ . This estimate is affected by the fact that the predicted Z̃ values lack of all the variability of the unobserved and imputed Y values. Analogously, the estimate of the residual variance of Y on X and Z is obtained. 1442 Marcello D’Orazio, Marco Di Zio, and Mauro Scanu 2.3 Consistent methods Another technique for the estimation of the regression parameters is that used in [MS01]. They estimate the parameters by means of their observed counterpart, for instance the estimation of the means with the average of the observed values, the variances with the observed variances, and so on. For the residual variance they suggest the following estimate: 2 σ̃Z|XY = s2Z;B − (sXZ;B σY∗ Z ) s2X;A sXY ;A sXY ;A s2Y ;A −1 sXZ;B σY∗ Z , where σY∗ Z is a postulated value. Since the construction of the residual variance is done by using different subsets, it may present the problem of being negative (or, when X, Y and Z are multivariate, negative definite). 2.4 Mixed methods The regression explained in Section 2.3 is used by [MS01] as a mixed method. They propose the following two steps: (i) Regression step. Intermediate values for the units in A are computed through: z̃a = µ̂Z + σ̂ZX|Y σ̂ZY |X (xa − µ̂X ) + 2 (ya − µ̂Y ) + eZ|XY 2 σ̂XX|Y σ̂Y |X for each a = 1, . . . , nA , where eZ|XY is randomly drawn from a Gaussian dis2 tribution with zero mean and residual variance σ̂Z|XY , and analogously for the units in B ỹb = µ̂Y + σ̂Y X|Z σ̂ZY |X (xb − µ̂X ) + 2 (zb − µ̂Z ) + eY |XZ 2 σ̂X|Z σ̂Z|X for each b = 1, . . . , nB , and eY |XZ randomly drawn from a Gaussian distribution with zero mean and residual variance σ̂Y2 |XZ . (ii) Matching step. For each a = 1, . . . , nA , the term zb∗ corresponding to the nearest b∗ in B is imputed. The distance is computed between the couples d ((ya , z̃a ), (ỹb , zb )), and the suggested metric is the Mahalanobis. The matching is constrained, i.e. in the distance computation it is considered also the number of times that a record is chosen as donor [ROD01]. The maximum likelihood approach is expected to give better results than those obtained by the techniques illustrated in Section 2.2 and 2.3. Since [WIL01], it was asked about the gain of efficiency when using ML estimates instead of the sample observed counterpart of the parameters to be estimated, as in MS. [LOR01] shows that the gain in efficiency by using µ̂Y instead of ȳA for the parameter µY is: V ar(µ̂Y ) nB ρ2 . = 1− V ar(ȳA ) nA + nB XY Thus, when both ρ2XY and the fraction of missing data in B are sufficiently large, µ̂Y is more efficient. ML estimation of regression parameters in statistical matching 1443 The use of ML estimators is expected to improve also the estimates of the residual variance with respect to RegRieps and MS. In the first, the improvement is about the underestimation of the residual variance of RegRieps, while for MS it is mainly expected in terms of coherence of estimates, [MS04]. The experiments will be devoted to the comparison of these three methods of estimating regression parameters. Moreover, a comparison between the mixed method in Section 2.4 with a mixed method where the regression step is performed according to the ML estimates is carried out. 3 Simulation results In [DDS05] an extensive simulation is carried out on trivariate normal distributions with zero mean and unit variance. In this paper the attention is limited only to the following correlation matrices: 0 ρ (1) 1 1 0.7 0.7 = 0.7 1 0.7 A , 0.7 0.7 1 0 ρ (2) 1 1 0.5 0.95 = 0.5 1 0.7 A 0.95 0.7 1 where moderate (ρ(1) ) and high (ρ(2) ) correlation between X and Z is assumed. In this way, it is shown how effective this relationship is for the statistical matching problem. Other simulations for different ρY Z are investigated in [DDS05]. For each normal distribution a random sample of 1000 units is drawn. It is randomly split in two subsamples consisting of 500 units each. Then, in A the variable Z is deleted, while the variable Y is deleted in B. The two files A and B are used as input of the three previously introduced procedures (MS, RegRieps, ML) in order to estimate the parameters of the normal distribution and, in particular, ρY Z . Obviously, given that ρY Z can not be estimated directly from the available data files, all the procedures start by considering a preassigned value, ρ∗Y Z , for it (or equivalently a value ρ∗Y Z|X is postulated for ρY Z|X ). This procedure is repeated 1 000 times for each combination of each initial starting correlation matrix with the postulated values for ρY Z (or ρY Z|X ). The 1 000 estimates provided by each method for the parameters of the trivariate normal distribution are compared with the true parameters and the following synthetic measures are computed: (i) the simulation bias (denoted as ‘Bias’ in tables), i.e. the average of the 1 000 estimates obtained for each parameter minus the true parameter value; (ii) the simulation MSE (‘MSE’ in tables), obtained as the average of the 1 000 squared differences between the estimated and true parameter values. In [DDS05] several different values for ρ∗Y Z are considered including the parameter value corresponding to the CIA (ρY Z|X =0). The results presented below are limited to the case of ρ∗Y Z equal to respectively 0.49 (ρ∗Y Z|X = 0.0, i.e. the CIA) and 0.6940 for ρ(1) . As far as ρ(2) is concerned we considered ρ∗Y Z = 0.4750 (ρ∗Y Z|X = 0.0, i.e. the CIA) and ρ∗Y Z = 0.69. The results in Tables 1 and 2 show that ML performs slightly better than the other methods in terms of both (simulation) Bias and MSE. This is more evident when a high correlation between Y and Z is assumed. Note that RegRieps performances are close to ML method in the estimation of means and correlations while some problems arise, as expected, in the estimation of the residual variances σY2 |XZ 1444 Marcello D’Orazio, Marco Di Zio, and Mauro Scanu Table 1. Simulation results for the estimation of some parameters for the trivariate normal distribution with correlation matrix ρ(1) ρ∗Y Z = 0.49 (CIA) MS RegRieps ML |Bias(µ̂Y )| M SE(µ̂Y )| |Bias(µ̂Z )| M SE(µ̂Z ) |Bias(ρ̂XY )| M SE(ρ̂XY ) |Bias(ρ̂XZ )| M SE(ρ̂XZ ) |Bias(σ̂Y2 |XZ )| M SE(σ̂Y2 |XZ ) 2 |Bias(σ̂Z|XY )| 2 M SE(σ̂Z|XY ) 0.00114 0.00198 0.00094 0.00207 0.00108 0.00106 0.00033 0.00097 0.07916 0.00932 0.07946 0.00937 0.00084 0.00151 0.00063 0.00154 0.00111 0.00045 0.00034 0.00045 0.09074 0.00923 0.09184 0.00952 0.00084 0.00151 0.00063 0.00154 0.00040 0.00045 0.00038 0.00045 0.08348 0.00794 0.08456 0.00822 ρ∗Y Z = 0.694 MS RegRieps 0.00210 0.00190 0.00091 0.00187 0.00059 0.00109 0.00092 0.00100 0.00257 0.00149 0.00026 0.00151 0.00162 0.00145 0.00050 0.00145 0.00027 0.00044 0.00074 0.00044 0.05559 0.00418 0.05367 0.00406 ML 0.00162 0.00145 0.00050 0.00145 0.00045 0.00044 0.00002 0.00044 0.00269 0.00070 0.00400 0.00078 Table 2. Simulation results for the estimation of some parameters for the trivariate normal distribution with correlation matrix ρ(2) ρ∗Y Z = 0.475 (CIA) MS RegRieps ML |Bias(µ̂Y )| M SE(µ̂Y )| |Bias(µ̂Z )| M SE(µ̂Z ) |Bias(ρ̂XY )| M SE(ρ̂XY ) |Bias(ρ̂XZ )| M SE(ρ̂XZ ) |Bias(σ̂Y2 |XZ )| M SE(σ̂Y2 |XZ ) 2 |Bias(σ̂Z|XY )| 2 ) M SE(σ̂Z|XY 0.00134 0.00200 0.00060 0.00200 0.00070 0.00147 0.00035 0.00054 0.49994 0.25348 0.06468 0.00874 0.00089 0.00174 0.00042 0.00115 0.00080 0.00110 0.00002 0.00001 0.73044 0.53744 0.27534 0.07749 0.00089 0.00174 0.00042 0.00115 0.00005 0.00110 0.00021 0.00001 0.51846 0.27112 0.06700 0.00453 ρ∗Y Z = 0.69 MS RegRieps 0.00196 0.00179 0.00186 0.00169 0.00103 0.00134 0.00211 0.00110 0.00058 0.00041 0.00148 0.00111 0.00059 0.00001 0.00056 0.00001 0.03970 0.74542 0.02518 0.56181 0.01611 3.65431 0.00269 13.69306 ML 0.00179 0.00169 0.00134 0.00110 0.00034 0.00111 0.00019 0.00001 0.03826 0.00175 0.00493 0.00003 2 and σZ|XY . The tables provide no results for the coefficient correlation ρY Z given that, as expected, all the methods yield exactly the starting value, i.e. ρ∗Y Z in MS and ρ∗Y Z|X in RegRieps and ML. Additional simulations have been carried out in order to compare the mixed methods (MS-mix in tables) suggested by [MS01] (see Section 2.4), with a similar methods which uses ML parameter estimates in the regression steps (ML-mix). The simulation steps are essentially the same as those before introduced with the further introduction of a final donor step. At first A is considered as recipient and Z is imputed. Then, the role of recipient is assigned to B and Y is imputed. ML estimation of regression parameters in statistical matching 1445 Due to the heavy computational effort required by the application of constrained donor imputation techniques, the entire simulation procedure has been iterated 500 times for each combination of the starting correlation matrix with the assumed value for ρ∗Y Z (ρ∗Y Z = 0.475 and ρ∗Y Z = 0.69 are considered for ρ(1) while ρ∗Y Z = 0.502 and ρ∗Y Z = 0.6913 are considered for ρ(1) ). The results of the simulation are summarized in Tables 3 and 4. All the simulations have been carried out in the R environment [R]. Table 3. Simulation results for the estimation of the correlation coefficients of ρ(1) par file ρ∗Y Z = 0.694 ρ∗Y Z = 0.898 MS-mix ML-mix MS-mix ML-mix ρXY B |Bias| MSE A |Bias| MSE A,B |Bias| MSE 0.01021 0.00269 0.00912 0.00249 0.01510 0.00068 ρXZ ρY Z 0.00920 0.00099 0.01095 0.00101 0.01600 0.00089 0.00946 0.00218 0.00872 0.00223 0.18640 0.03494 0.00936 0.00078 0.01191 0.00089 0.18408 0.03412 Table 4. Simulation results for the estimation of the correlation coefficients of ρ(2) par file ρ∗Y Z = 0.502 ρ∗Y Z = 0.6913 MS-mix ML-mix MS-mix ML-mix ρXY B |Bias| MSE A |Bias| MSE A,B |Bias| MSE 0.00352 0.00233 0.01762 0.00185 0.20474 0.04241 ρXZ ρY Z 0.00792 0.00165 0.01297 0.00020 0.20621 0.04390 0.00409 0.00232 0.02283 0.00157 0.02723 0.00461 0.00734 0.00150 0.01294 0.00021 0.01863 0.00137 The results in Tables 3 and 4 show that the two methods perform closely as far as bias is concerned. ML-mix method seems slightly more stable given that its simulation MSE is almost always smaller than MS-mixed MSE. These results show that the ML approach comes out to be preferable, because of the better performance in terms of estimation. This finding is also confirmed from the simulation results reported in [DDS05]. Furthermore, the problem of coherence of estimates from different files is naturally resolved. References [DDS01] D’Orazio, M., Di Zio, M., Scanu, M.: Statistical Matching: a tool for integrating data in National Statistical Institutes. Proceedings NTTS - 1446 Marcello D’Orazio, Marco Di Zio, and Mauro Scanu [DDS02] [DDS05] [DDS06] [LOR01] [MS01] [MS04] [Räs02] [R] [ROD01] [SMK93] [WIL01] ETK 2001, Hersonissos (Greece) 18 - 22 June 2001, volume I, 433–441 (2001) D’Orazio, M., Di Zio, M., Scanu, M.: Statistical Matching and Official Statistics. Rivista di Statistica Ufficiale 1, 5–24 (2002) D’Orazio, M., Di Zio, M., Scanu, M.: A comparison among different estimators of regression parameters on statistically matched files through an extensive simulation study. Istituto Nazionale di Statistica, Roma Contributi 2005/10 (2005) D’Orazio, M., Di Zio, M., Scanu, M.: Statistical Matching: Theory and Practice. Wiley, Chichester (2006) Lord, F. M.: Estimation of parameters from incomplete data. Journal of the American Statistical Association 50, 870–876 (1955) Moriarity, C., Scheuren, F.: Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure. Journal of Official Statistics, 17, 407– 422 (2001) Moriarity, C., Scheuren, F.: Regression–based statistical matching: recent developments. Proceedings of the Section on Survey Research Methods, American Statistical Association (2004). Rässler, S.: Statistical Matching: a Frequentist Theory, Practical Applications and Alternative Bayesian Approaches. Springer, New York (2002) R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2005) Rodgers, W. L.: An Evaluation of Statistical Matching. Journal of Business and Economic Statistics 2, 91–102 (1984) Singh, A., Mantel, H., Kinack, M., Rowe, G.: Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology, 19, 59–79 (1993) Wilks, S. S.: Moments and distributions of estimates of population parameters from fragmentary samples. Annals of Mathematical Statistics 3, 163–194 (1932)