S1 Appendix.

advertisement
Module-based Association Analysis for
Omics Data with Network Structure
Zhi Wang1, Arnab Maity2, Chuhsing Kate Hsiao3, Deepak Voora4,
Rima Kaddurah-Daouk5, Jung-Ying Tzeng1,2,6
1: Bioinformatics Research Center, North Carolina State University, Raleigh NC, 27695, USA
2: Department of Statistics, North Carolina State University, Raleigh NC, 27695, USA
3: Institute of Epidemiology and Preventive Medicine, College of Public Health, National
Taiwan University, Taipei, Taiwan
4: Institute for Genome Sciences and Policy, Duke University, Durham, NC, USA
5: Department of Psychiatry and Behavioral Sciences, Duke University, Durham, NC, USA
6: Department of Statistics, National Cheng-Kung University, Taiwan, R.O.C.
RUNNING TITLE: Module-based analysis for structured omics data
ADDRESS FOR CORRESPONDENCE:
Jung-Ying Tzeng, Department of Statistics and Bioinformatics Research Center,
North Carolina State University, Campus Box 7566, Raleigh NC, 27695, USA.
Tel: 919-513-2723. Fax: 919-515-7315. E-mail:jytzeng@stat.ncsu.edu.
KEY WORDS: module structure; network structure; association analysis; metabolomics
1
APPENDIX
Derivation of the score test statistics and their distributions
Consider the linear mixed model representation given in model (3). As our primary
interest is to test the variance components 𝜏1 , 𝜏2 , 𝜏12 , we propose to use the
restricted maximum likelihood (REML) function to estimate the variance
components (𝜏1 , 𝜏2 , 𝜏12 , 𝜎). We have that the REML estimate under model (3) is
ℓ𝑅𝐸𝑀𝐿 (𝜏1 , 𝜏2 , 𝜏12 ; π‘Œ) = −{log|𝑉| + log|𝑍 𝑇 𝑉 −1 𝑍| + π‘Œ 𝑇 π‘ƒπ‘Œ}/2,
where 𝑉 = 𝜏1 𝐾1 + 𝜏2 𝐾2 + 𝜏12 𝐾12 + 𝜎𝐼 is the marginal variance of Y and 𝑃 = 𝑉 −1 −
𝑉 −1 𝑍(𝑍 𝑇 𝑉 −1 𝑍)−1 𝑍 𝑇 𝑉 −1 is a projection matrix. The score functions based on the
REML can be obtained as below (Harville 1977):
𝑋 ∗𝑋2
Under 𝐻0 1
: 𝜏12 = 0,
π‘ˆπœ12 (πœΜ‚1 , πœΜ‚2 , 0, πœŽΜ‚) =
πœ•β„“π‘…πΈπ‘€πΏ (𝜏1 , 𝜏2 , 𝜏12 , 𝜎)
|
πœ•πœ12
𝜏
Μ‚,
Μ‚,𝜎=𝜎
Μ‚
12 =0,𝜏1 =𝜏
1 𝜏2 =𝜏
2
𝑋
1 ∗𝑋2
1
= {π‘Œ 𝑇 𝑃12 𝐾12 𝑃12 π‘Œ − π‘‘π‘Ÿ(𝑃12 𝐾12 )}.
2
𝑋 |𝑋2
Under 𝐻0 1
: 𝜏1 = 0 with the constraints of 𝜏12 = 0,
π‘ˆπœ1 (0, πœΜ‚2 , 0, πœŽΜ‚) =
πœ•β„“π‘…πΈπ‘€πΏ (𝜏1 , 𝜏2 , 𝜏12 , 𝜎)
|
πœ•πœ1
𝜏
Μƒ,𝜎=𝜎
Μƒ
12 =0,𝜏1 =0, 𝜏2 =𝜏
2
𝑋1|𝑋2
1
= {π‘Œ 𝑇 𝑃1 𝐾1 𝑃1 π‘Œ − π‘‘π‘Ÿ(𝑃1 𝐾1 )}.
2
𝑋 |𝑋1
Under 𝐻0 2
: 𝜏2 = 0 with the constraints of 𝜏12 = 0,
π‘ˆπœ2 (πœΜ‚1 , 0,0, πœŽΜ‚) =
πœ•β„“π‘…πΈπ‘€πΏ (𝜏1 , 𝜏2 , 𝜏12 , 𝜎)
|
πœ•πœπΈ
𝜏
Μƒ,
Μƒ
12 =0,𝜏1 =𝜏
1 𝜏2 =0,𝜎=πœŽπ‘‹2|𝑋1
1
= {π‘Œ 𝑇 𝑃2 𝐾2 𝑃2 π‘Œ − π‘‘π‘Ÿ(𝑃2 𝐾2 )},
2
where 𝑃𝑑 = 𝑉𝑑−1 − 𝑉𝑑−1 𝑍(𝑍 𝑇 𝑉𝑑−1 𝑍)−1 𝑍 𝑇 𝑉𝑑−1 for 𝑑 = {12, 1, 2} ,with 𝑉12 = 𝜏1 𝐾1 +
𝜏2 𝐾2 + 𝜎𝐼, 𝑉1 = 𝜏2 𝐾2 + 𝜎𝐼 and and 𝑉2 = 𝜏1 𝐾1 + 𝜎𝐼.
2
NULL DISTRIBUTION OF THE SCORE STATEISTICS FOR GE TEST
Because score statistics are not asymptotically normal (Tzeng and Zhang 2007), we
use the first term of the score statistics as the testing statistics. For interaction test,
1
the test statistic is 𝑇𝑋1 ∗𝑋2 = 2 π‘Œ 𝑇 𝑃12 𝐾12 𝑃12 π‘Œ. Define πœ‡ = 𝑍𝛽 , then 𝑇𝑋1 ∗𝑋2 =
1
(π‘Œ − πœ‡)𝑇 𝑃12 𝐾12 𝑃12 (π‘Œ − πœ‡) because πœ‡ 𝑇 𝑃12 = 0.
2
1 𝑇
𝐢
2
1
2
1
2
(𝑉 𝑃12 𝐾12 𝑃12 𝑉 ) 𝐢 ,
Further, we can rewrite 𝑇𝑋1 ∗𝑋2 =
1
−
2
where 𝐢 = 𝑉 (π‘Œ − πœ‡) and
it
follows
a
standard
multivariate normal distribution. Define 𝑒𝑖 and πœ‚π‘– the eigenvector and eigenvalue of
matrix 𝑉 1/2 𝑃12 𝐾12 𝑃12 𝑉 1/2 /2, respectively, then 𝑇𝑋 ∗𝑋 = ∑𝑐𝑖=1 πœ‚π‘– (𝑒𝑖𝑇 𝐢)2 ≡ ∑𝐿𝑖=1 πœ‚π‘– 𝐢̃𝑖2
1
with
𝐢̃𝑖2
2
follows a 1 𝑑𝑓 chi-square distribution. Therefore the distribution of 𝑇𝑋1 ∗𝑋2
2
can be approximated by the distribution of ∑𝑐𝑖=1 πœ‚Μ‚ 𝑖 πœ’π‘–1
, where πœ‚Μ‚ 𝑖 ′𝑠 are the non-zero
1
1
eigenvalues of 𝑉 2 𝑃12 𝐾12 𝑃12 𝑉 2 /2|𝜏12 =0,𝜏1 =πœΜ‚,
. Hence, we can use a
Μ‚,𝜎=𝜎
Μ‚
1 𝜏2 =𝜏
2
𝑋
∗𝑋
1
2
moment matching approach to obtain p-values (Duchesne and Lafaye De Micheaux
2010).
Above we use the interaction test as an example and derive the test statistics
and its null distribution. By similar argument, we can approximate the null
2
distributions of 𝑇𝑋1 |𝑋2 and 𝑇𝑋2 |𝑋1 using the distribution of ∑𝑐𝑖=1 πœ‚Μ‚ 𝑖 πœ’π‘–1
where πœ‚Μ‚ 𝑖 ′𝑠 are
1
1
1
1
the non-zero eigenvalues of 𝑉 2 𝑃1 𝐾1 𝑃1 𝑉 2 /2|𝜏12 =0,𝜏1=0, 𝜏2 =πœΜƒ,𝜎=𝜎
and𝑉 2 𝑃2 𝐾2 𝑃2 𝑉 2 /
Μƒ
2
𝑋
|𝑋
1
2
2|𝜏12 =0,𝜏1 =πœΜƒ,
, respectively.
Μƒ
1 𝜏2 =0,𝜎=πœŽπ‘‹
|𝑋
2
1
EM ALGORITHM FOR THE REML ESTIMATES OF π‰πŸ AND π‰πŸ WHEN TESTING
𝑿 ∗π‘ΏπŸ
π‘―πŸŽ 𝟏
: π‰πŸπŸ = 𝟎
Using the interaction test (𝑇𝑋1 ∗𝑋2 ) as an example, we derive the EM algorithm for
𝑋 ∗𝑋2
estimating the nuisance variance components (VC), 𝜏1 , 𝜏2 , and 𝜎, under 𝐻0 1
. The
EM algorithms for estimating nuisance VCs for the 𝑋1 |𝑋2 test and the 𝑋2 |𝑋1 test can
be obtained by zeroing out the corresponding variance components. In short, the
derivation of the EM algorithm is similar to the one derived in Tzeng et al. (2011).
Let 𝑒 = 𝐴𝑇 π‘Œ with 𝐴𝑇 𝐴 = 𝐼𝑛∗𝑛 π‘Žπ‘›π‘‘ 𝐴𝐴𝑇 = 𝐼 − 𝑍(𝑍 𝑇 𝑍)−1 𝑍𝑋 𝑇 . Then 𝑓(𝑒|β„Ž1 , β„Ž2 ) follows
normal distribution with mean 𝐴𝑇 β„Ž1 + 𝐴𝑇 β„Ž2 and variance 𝜎𝐼 and does not depend
on the fixed effect 𝛽. Therefore, the REML estimators of 𝜏1 and 𝜏2 can be based on
3
their marginal distributions, 𝑓(𝑒) = ∫ ∫ 𝑓(𝑒|β„Ž1 , β„Ž2 )𝑓(β„Ž1 , β„Ž2 )π‘‘β„Ž1 π‘‘β„Ž2 . This motivated
the EM algorithm based on observed data 𝑒 and missing data β„Ž1 and β„Ž2 .
The complete data log likelihood is given be
π‘™π‘œπ‘”π‘“(𝑒, β„Ž1 , β„Ž2 ; 𝜏1 , 𝜏2 , 𝜎) = π‘™π‘œπ‘”π‘“(𝑒|β„Ž1 , β„Ž2 ; 𝜏1 , 𝜏2 , 𝜎) + π‘™π‘œπ‘”π‘“(β„Ž2 ; 𝜏2 , 𝜎) + π‘™π‘œπ‘”π‘“(β„Ž1 ; 𝜏1 , 𝜎)
𝑛−𝑑
1
(𝑒 − 𝐴𝑇 β„Ž1 − 𝐴𝑇 β„Ž2 )𝑇 (𝑒 − 𝐴𝑇 β„Ž1 − 𝐴𝑇 β„Ž2 )
=−
π‘™π‘œπ‘” 𝜎 −
2
2𝜎
𝑛
1
1 𝑇 −1
− log 𝜏1 − log|𝐾1 | −
β„Ž 𝐾 β„Ž
2
2
2𝜏1 1 1 1
𝑛
1
1
− 2 log 𝜏2 − 2 log|𝐾2 | − 2𝜏 β„Ž2𝑇 𝐾2−1 β„Ž2 .
2
In the expectation step, we calculate the expected value of the log likelihood
(𝑑)
(𝑑)
function, 𝑄(𝜏1 , 𝜏2 , 𝜎|πœΜ‚1 , πœΜ‚2 , πœŽΜ‚ (𝑑) ) with respect to the observed data 𝑒 under the
(𝑑)
(𝑑)
current (the 𝑑-th iteration) estimate of the parameters πœΜ‚1 , πœΜ‚ 2 π‘Žπ‘›π‘‘ πœŽΜ‚ (𝑑) ,
(𝑑)
(𝑑)
(𝑑)
(𝑑)
𝑄(𝜏1 , 𝜏2 , 𝜎|πœΜ‚1 , πœΜ‚2 , πœŽΜ‚ (𝑑) ) = 𝐸[π‘™π‘œπ‘”π‘“(𝑒, β„Ž1 , β„Ž2 ; 𝜏1 , 𝜏2 , 𝜎)|𝑒; πœΜ‚1 , πœΜ‚2 , πœŽΜ‚ (𝑑) ]
=−
𝑛−𝑑
π‘™π‘œπ‘” 𝜎
2
1
𝐸{(𝑒 − 𝐴𝑇 β„Ž1 − 𝐴𝑇 β„Ž2 )𝑇 (𝑒 − 𝐴𝑇 β„Ž1
2𝜎
(𝑑) (𝑑)
− 𝐴𝑇 β„Ž2 )|𝑒; πœΜ‚1 , πœΜ‚2 , πœŽΜ‚ (𝑑) }
𝑛
1
1
(𝑑) (𝑑)
− log 𝜏1 − log|𝐾1 | −
𝐸{β„Ž1𝑇 𝐾1−1 β„Ž1 |𝑒; πœΜ‚1 , πœΜ‚ 2 , πœŽΜ‚ (𝑑) }
2
2
2𝜏𝐺
𝑛
1
1
(𝑑) (𝑑)
− log 𝜏2 − log|𝐾2 | −
𝐸{β„Ž2𝑇 𝐾2−1 β„Ž2 |𝑒; πœΜ‚1 , πœΜ‚ 2 , πœŽΜ‚ (𝑑) }.
2
2
2𝜏2
−
(𝑑)
(𝑑)
πœ•π‘„
In the maximization step, we maximize 𝑄(𝜏1 , 𝜏2 , 𝜎|πœΜ‚1 , πœΜ‚ 2 , πœŽΜ‚ (𝑑) ) by solving πœ•πœ =
πœ•π‘„
πœ•π‘„
2
πœ•πœŽ
0, πœ•πœ = 0 and
(𝑑+1)
πœΜ‚1
(𝑑+1)
πœΜ‚2
πœŽΜ‚ (𝑑+1)
1
= 0 and obtain the following estimates
1
(𝑑) (𝑑)
𝐸{β„Ž1𝑇 𝐾1−1 β„Ž1 |𝑒; πœΜ‚1 , πœΜ‚2 , πœŽΜ‚ (𝑑) }
𝑛
1
= {πœΜ‚1 π‘Œ 𝑇 𝑃12 𝐾1 𝑃12 π‘Œ + π‘‘π‘Ÿ(𝜏1 𝐼 − 𝜏12 𝑃12 𝐾1 )};
𝑛
1
(𝑑) (𝑑)
= 𝐸{β„Ž2𝑇 𝐾2−1 β„Ž2 |𝑒; πœΜ‚1 , πœΜ‚ 2 , πœŽΜ‚ (𝑑) }
𝑛
1
= {πœΜ‚ 2 π‘Œ 𝑇 𝑃12 𝐾2 𝑃12 π‘Œ + π‘‘π‘Ÿ(𝜏2 𝐼 − 𝜏22 𝑃12 𝐾2 )};
𝑛
1
(𝑑) (𝑑)
=
𝐸{(𝑒 − 𝐴𝑇 β„Ž1 − 𝐴𝑇 β„Ž2 )𝑇 (𝑒 − 𝐴𝑇 β„Ž1 − 𝐴𝑇 β„Ž2 )|𝑒; πœΜ‚1 , πœΜ‚2 , πœŽΜ‚ (𝑑) }
𝑛−𝑑
Μƒ )𝑇 𝐴𝐴𝑇 (π‘Œ − 𝑀
Μƒ ) + π‘‘π‘Ÿ(𝐴𝑇 𝑉̃ 𝐴),
= (π‘Œ − 𝑀
=
4
(𝑑)
(𝑑)
Μƒ = 𝐸(β„Ž1 + β„Ž2 |𝑒; πœΜ‚1 , πœΜ‚ 2 , πœŽΜ‚ (𝑑) ) = (𝜏1 𝐾1 +
where 𝐴𝐴𝑇 = 𝐼 − 𝑍(𝑍 𝑇 𝑍)−1 𝑍 𝑇 , 𝑀
(𝑑) (𝑑)
𝜏2 𝐾2 )𝑃12 , ̃𝑉 = π‘£π‘Žπ‘Ÿ(β„Ž1 + β„Ž2 |𝑒; πœΜ‚1 , πœΜ‚ 2 , πœŽΜ‚ (𝑑) ) = 𝜏1 𝐾1 − 𝜏12 𝐾1 𝑃12 𝐾1 + 𝜏2 𝐾2 −
Μƒ and 𝑉̃ are obtained from the joint distribution of
𝜏22 𝐾2 𝑃12 𝐾2 − 2𝜏1 𝜏2 𝐾2 𝑃12 𝐾1 , and 𝑀
(𝑒, β„Ž1 , β„Ž2 ).
REFERENCE
Duchesne P, Lafaye De Micheaux P. 2010. Computing the distribution of quadratic
forms: Further comparisons between the Liu–Tang–Zhang approximation
and exact methods. Comput Stat Data Anal 54: 858-862.
Harville D. 1977. Maximum likelihood approaches to variance component
estimation and related problems. J Am Stat Assoc 72:322–340.
Tzeng JY, Zhang D. (2007) Haplotype-based association analysis via variancecomponents score test. Am J Hum Genet 81:927-38.
Tzeng JY, Zhang D, Pongpanich M, Smith C, McCarthy MI, Sale MM, Worrall BB, Hsu
FC, Thomas DC, Sullivan PF. 2011. Studying gene and gene-environment
effects of uncommon and common variants on continuous traits: a markerset approach using gene-trait similarity regression. Am J Hum Genet 12: 27788.
Zhang B, Horvath S. 2005. A general framework for weighted gene co-expression
network analysis. Stat Appl Genet Molec Biol 4: 1128.
5
Download