Human Molecular Genetics Advance Access originally published online on July 31, 2007
Human Molecular Genetics 2007 16(20):2494-2505; doi:10.1093/hmg/ddm205
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Evaluation of genome-wide power of genetic association studies based on empirical data from the HapMap project
1 Department of Hematology/Oncology, 2 Department of Cell Therapy and Transplantation Medicine, Graduate School of Medicine and 3 Department of Information and Communication Engineering, Graduate School of Information Science, University of Tokyo, Tokyo 113-8655, Japan and 4 Core Research for Evolutional Science and Technology, Japan Science and Technology Agency, Saitama 332-0012, Japan
* To whom correspondence should be addressed to: Department of Cell Therapy and Transplantation Medicine, The 21st Century COE Program, Graduate School of Medicine, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8655, Japan. Tel: +81 358008741; Fax: +81 358046261; Email: sogawa-tky{at}umin.ac.jp
Received April 7, 2007; Accepted July 22, 2007
| ABSTRACT |
|---|
|
|
|---|
With recent advances in high-throughput single nucleotide polymorphism (SNP) typing technologies, genome-wide association studies have become a realistic approach to identify the causative genes that are responsible for common diseases of complex genetic traits. In this strategy, a trade-off between the increased genome coverage and a chance of finding SNPs incidentally showing a large statistics becomes serious due to extreme multiple-hypothesis testing. We investigated the extent to which this trade-off limits the genome-wide power with this approach by simulating a large number of case-control panels based on the empirical data from the HapMap Project. In our simulations, statistical costs of multiple hypothesis testing were evaluated by empirically calculating distributions of the maximum value of the
2 statistics for a series of marker sets having increasing numbers of SNPs, which were used to determine a genome-wide threshold in the following power simulations. With a practical study size, the cost of multiple testing largely offsets the potential benefits from increased genome coverage given modest genetic effects and/or low frequencies of causal alleles. In most realistic scenarios, increasing genome coverage becomes less influential on the power, while sample size is the predominant determinant of the feasibility of genome-wide association tests. Increasing genome coverage without corresponding increase in sample size will only consume resources without little gain in power. For common causal alleles with relatively large effect sizes [genotype relative risk
1.7], we can expect satisfactory power with currently available large-scale genotyping platforms using realistic sample size (
1000 per arm). | INTRODUCTION |
|---|
|
|
|---|
Genome-wide association studies have been proposed as a strategy to identify genetic factors with small to moderate genetic effects in the development of human diseases, as typically assumed for a common disease common variant (CDCV) model (1). In this strategy, a disease-associated locus is identified through single nucleotide polymorphisms (SNPs) that show significantly different allele frequencies between affected (cases) and unaffected (controls) individuals, and a large number of SNPs are tested for association in an attempt to realistically identify such SNPs (2,3). Although only a theoretical perspective a decade ago (1), with the unprecedented advance in large-scale genotyping technologies (4–6), it has now become a realistic approach to exploring the genetic basis of human disease (7,8). In addition, recent efforts in the International HapMap Project to understand the genetic diversity among human populations (9,10) have greatly contributed to clarifying the extent to which the number of marker SNPs could be reduced to achieve given genome coverage, or how much genome coverage can be obtained with a given marker SNP set by optimally tagging untyped SNPs based on the linkage disequilibrium (LD) observed in the human genome (11–16).
Meanwhile, the major interest of the most researchers, who plan genetic association studies, would be the practical success rates in such attempts and their efficient study designs, rather than mere genome coverage (17,18), because increase in genome coverage might not be linearly translated into gain in power (19,20). In addition, the more SNPs are genotyped to achieve better genome coverage, the higher hurdle is imposed for a target allele to be detected.
This dilemma, known as the trade-off between increased genome coverage and the consequent inflation of null statistics due to extreme multiple testing, is a unique feature of genetic association studies, and is best described by considering the distributions of test statistics for markers truly associated with a causative allele (causal distribution) and for all other markers (null distribution) (21). Regardless of the properties of the causative SNP and whether one or more tagging strategies are used, the null distribution for a given marker set depends on its genome coverage in the study population. In particular, the null distribution with complete genome coverage is related to the overall diversity of the human genome and should substantially shift to the right (7,8,22). On the other hand, for a given disease model, the size of the test statistic expected for the causative SNPs is limited by the number of samples to be analyzed, once they are directly captured by one or more marker SNPs. After all, the feasibility of genome-wide association studies, or the required sample size to obtain realistic power, is determined by the overall diversity of the human genome, or given restricted study resources, the diversity of the human genome determines the property of disease-associated SNPs that can be detected with this approach.
Our questions are, therefore, how diverse is the human genome in view of conducting genome-wide association studies, how much power could be obtained to identify causative SNPs given that diversity and how the typical study parameters affects the power in that situation? To answer these questions, we need to evaluate both null and causal distributions in a quantitative manner. Because both distributions intrinsically depend on the LD structure within N (typically >
105–6) interrelated marker SNPs and the particular location of causative SNPs within the genome, they cannot be calculated in an algebraic manner, but need to be estimated based on the observed data of human genome variations (10,21). So we approach these issues by extensively simulating a large number of case-control panels under both null and alternative scenarios based on the data from the International HapMap Consortiums (9,10), and assess the feasibility and efficient designs of whole genome association studies by estimating the genome-wide power that would be obtained using this genetic approach under varying study conditions.
| RESULTS |
|---|
|
|
|---|
Estimation of null distributions of the maximum
2 statisticsIn considering the issue of multiple testing in genetic association studies, it is convenient to evaluate the maximum value of the
2 statistic [max(
2)] in all the marker SNPs that are truly unrelated to the causative SNP (21). Different statistics can be used (23–26), but the power calculated for this statistic, i.e. the probability of max(
2) indicating a true association, will provide a reasonable bottom line to discuss the feasibility of typical genetic association studies (21). When all N marker SNPs are independent, the null distribution for max(
2) is given as |
|
(
2) is the cumulative density function of the
2 distribution (d.f. = 1). However, since SNPs in real marker sets are variably degenerated due to the presence of LD between adjacent SNPs, we empirically estimated the distribution of max(
2) for a series of marker sets by simulating 10 000 null case-control panels, where each panel was generated by randomly resampling phased chromosomes from the HapMap data sets, and max(
2) was calculated for each case-control panel. Although the number of resampled chromosomes for each case-control panel (i.e. the sample size) does not significantly affect the distributions (data not shown), there arises some concern about the possibility of underestimating the null distributions due to resampling from very limited numbers of chromosomes, because the latter procedure could restrict the freedom of allelic segregation within the same chromosome. To address this issue, we progressively divided the whole genome into larger numbers of sub-blocks consisting of 10 000 to 10 SNPs in the HapMap Phase II set, and resampled these sub-blocks to simulate distributions of max(
2). Reducing the mean block size down to 7.1 kb, these divisions allow for greater freedom of allelic segregation, but does not significantly affect the max(
2) distributions until the resampled block size becomes smaller than the mean LD length (27), indicating that our simulations are not likely to substantially underestimate the null distributions (Supplementary Material, Figure S1).
Figure 1 A shows the simulated null distributions in the CEU panel for varying numbers of randomly selected SNPs (correlated SNP sets). The number of segregating or polymorphic markers contained in each random set is designated as Ns. The theoretical distribution for the same numbers (Ns) of independent SNPs,
Ns(
2), is also provided (Fig. 1B). The null distribution increases as the number of randomly selected SNP markers increases, and in a random 1000K set containing 681K segregating SNPs, the threshold
2 value that provides a genome-wide P-value of 0.05 or 0.01 becomes as large as 27.6 or 30.5, respectively. On the other hand, reflecting the growing inter-marker LD intensity, the empirical distributions gradually deviate from the theoretical ones,
Ns(
2)s, for increasing Ns within the corresponding marker sets, underscoring the importance of considering inter-marker LD to avoid overestimation of the statistical threshold for multiple testing, especially for higher marker density.
|
Evaluation of the inter-marker LD
The intensity of the inter-marker LD in a given marker set is more simply evaluated by fitting the simulated distribution to a theoretical one for independent Nc makers,
Nc(
2) (see Methods). Irrespective of marker sets, fitting is finely performed except in the vicinity of the maximal points (Supplementary Material, Figure S2). In particular, the distribution in extreme
2 values is satisfactorily approximated to provide a rough estimate of the nominal P-value for given genome-wide thresholds as confirmed by the concordance of the upper p point in the simulated distribution with the upper p/Nc point in the
2 distribution (d.f. = 1) (Bonferroni) (Table 1). In this formulation, it is reasonable to regard Nc as the number of hypothetical independent SNPs equivalent to the corresponding marker set, where the null distribution for a large number of mutually degenerated SNPs is described by an integer and the mean intensity of the inter-marker LD is measured through the Nc/Ns ratio.
|
Nc values were calculated for a variety of randomly selected SNP marker sets and plotted against the number of segregating SNP markers therein (Fig. 1C). As the Phase II data contain most of the SNPs in commercially available platforms, including Affymetrix® GeneChip® and Illumina® HumanHap® arrays (28–30), Nc values were also evaluated for these platforms (Supplemental Material, Table S1). Note that the numbers of segregating SNP markers varies among different HapMap panels, even though the same numbers of SNPs are randomly selected for each panel (Supplementary Material, Figure S3). Figure 1C illustrates how the degree of degeneration within marker SNPs increases in different HapMap panels as more marker SNPs are selected. For example, 681K segregating SNPs within a random 1000K set in the CEU panel are equivalent to independent 290K SNPs, indicating that in this panel, these SNPs are degenerated 2.3-fold. On the other hand, the degeneration in 1000K random markers is reduced to 1.8-fold for the YRI panel, as expected from the lower inter-marker LD for this panel compared to that of CEU.
The SNPs on the Affymetrix® GeneChip® mapping array sets are degenerated to the same degree as random SNP sets, reflecting the fact that the SNPs on GeneChip® platforms are virtually randomly selected. In contrast, the SNPs on the Illumina® HumanHap300 are selected by efficiently tagging the HapMap Phase I SNPs in CEU, in which redundant SNPs are effectively eliminated (28). As a result, degeneration in the HumanHap300 is substantially reduced compared to the corresponding random marker sets. In CEU, Nc for this 305.1K segregating SNP set (215K Nc) exceeds that for 417.8K segregating SNPs on GeneChip® 500K set (196K), as predicted by the higher genome coverage of the former set (see Table 1 and Supplementary Material, Figure S4). The tagging for CEU also increases the Nc in JPT+CHB, suggesting that tagging in one panel is also effective to a certain degree for another (31,32). The tagging seems to be less efficient in YRI, because the Nc value of HumanHap300® in YRI is less deviated from that of the random marker set with a corresponding Ns. In HumanHap550®, more tag SNPs are selected from YRI, which contributes to the relative increase in Nc for this marker set compared to that for the corresponding random marker SNP set.
Estimation of Nc for common SNPs in complete genome coverage
It is particularly interesting to calculate the Nc values for the ENCODE regions, in which human variations have been most densely explored. Currently 10 regions have been extensively genotyped in the ENCODE Project (http://www.hapmap.org/downloads/encode1.html.en), of which we used 7 regions that had been randomly chosen from the genome. A total of 7741, 9832 and 7396 SNPs are segregated in these seven ENCODE regions, and they are equivalent to 1340 (5.8-fold), 2580 (3.8-fold), and 1460 (5.1-fold) hypothetical independent SNPs, in the CEU, YRI, and JPT+CHB panels, respectively. Assuming the entire genome shows the similar LD intensity to that in the seven ENCODE regions on average, the Nc values for common SNPs in complete genome coverage (NcG) are roughly estimated to be 1971K (YRI), 1023K (CEU), and 1115K (JPT+CHB) (Table 2), although the values would be much more inflated if rare polymorphisms [minor allele frequency (MAF) <0.01], many of which could not be found in the HapMap panels, are taken into consideration. Nc/NcG could also be used as another indicator of genome coverage of a given marker set.
|
Causal distribution of max(
2)In view of power estimation, our next interest was the expected size of causal distributions relative to that of the inflated null distributions under varying disease/study parameters that affect the former distributions. To illustrate this, we simulated causal distributions of max(
2) for representative CEU alleles assumed to be causative (Fig. 2). Two thousand case-control panels were generated for each simulation, in which phased HapMap SNPs within 500 Kb around the causative locus were randomly resampled assuming a multiplicative model with varying genotype relative risks (GRRs) and the max(
2) was calculated for the resampled marker SNPs on GeneChip® 500K. Prevalence of the trait was set to 0.05. While the
2 threshold for genome-wide p of 0.05 could inflate from 19.9 for the random 10K set (6K Nc; semi-solid line) to as high as 29.8 for complete genome coverage (1023K NcG; dotted lines), these costs of multiple testing are acceptable when LD capture of the causative SNP by one or more markers with high correlation coefficient (r2) can create large causal distributions with practical sample sizes (Fig. 2D–F), i.e. when the causal allele is common (MAF>0.2) and has a large GRR (>1.7) (Fig. 2A, D and G). In contrast, in the case where the causal allele with smaller MAF value (<0.2) or with a modest to weak GRR (<1.5) is to be detected, the trade-off between increased chance to capture the allele with higher r2 using more markers and the accompanying cost of multiple testing can offset the power to varying degrees (Fig. 2A–C, G–I). The effect of collaborative capture, i.e. the probability of detecting an association by one of the multiple surrounding marker SNPs other than the SNPs showing max(r2), creates measurable gain in causal distributions and overall power, but does not essentially influence the above observations (Supplementary Material, Figure S5).
|
Estimation of genome-wide power
Based on the above consideration, we estimated the genome-wide power in genetic association studies for common (MAF
0.05) causal alleles with weak to moderate genetic effects. To do this, after assuming all the common SNPs in the human genome being equally causative, we used two sets of SNPs, the RefENCODE and the RefPhase II 5 Kb sets (see Methods), as references that are considered as random sampling from the entire SNPs. For each putative causative SNP, we simulated case-control panels as described in the previous section, and calculated the single point power as the proportion of simulated panels whose max(
2) exceeded a predetermined
2 threshold corresponding to a genome-wide P = 0.01 or 0.05 for each marker set. For genome-wide power, each single point power was averaged for all common SNPs within the reference set. For the RefPhase II 5 Kb set, over-representation of the direct association was adjusted based on the estimated genome coverage of the Phase II data set (see Methods). Figure 3 shows the genome-wide power in the CEU panel that was calculated for the RefPhase II 5 Kb for moderate to small effect sizes (i.e. GRR
1.7) assuming various parameter values. The calculation on the RefENCODE set provides a largely equivalent estimation of the power (Supplementary Material, Figure S6), although the power is expected to be less reliable for smaller marker sets, reflecting their poor representation of the genome.
|
Under strong genetic effects (GRR
2.0) and large sample sizes (
1500/arm), the power tends to saturate as the number of randomly selected SNPs increases (
250K), because most of the common SNPs would be already captured by one or more marker SNPs with enough r2 (Supplementary Material, Figure S4), and the capture causes large shifts of causal distributions to the extent that the cost of multiple testing is trivial (Fig. 2). On the other hand, when causative SNPs with weak to moderate genetic effects are detected with insufficient sample numbers, causal distributions cannot exceed large thresholds resulting from extreme multiple testing, even though more and more SNPs are captured by strong LD. With increasing effect size and sample number, the genome coverage is less influential except for smaller numbers of marker SNPs (<250K). The power gain obtained with increased genome-coverage tends to be offset by the increased cost of multiple testing. After all, in most scenarios, genome coverage is less influential on power when
250K random markers or equivalent tag SNPs are used. In contrast, the effect of sample numbers is predominant. To detect weak genetic effects (GRR
1.3), the number of samples becomes critical. More than 4000 samples per arm will be required, but the requirement of genome coverage is not substantially increased when more than 250K randomly selected SNPs or their equivalents are used (Fig. 3A). Given a higher genetic effect, this dependence on sample size is dramatically ameliorated, but the genome coverage remains less influential.
Power in different HapMap panels and in commercially available platforms
Power is significantly reduced in YRI compared to CEU and JPT+CHB for any marker set (Fig. 4A–C). The lower power in YRI is mainly due to the lower relative genome coverage of the marker set (Nc/NcG), rather than the higher cost of type I errors in this population.
|
The Illumina® HumanHap® series are commercially available platforms that incorporate the tagging theory, in which marker SNPs were selected to efficiently tag the CEU SNPs in the Phase I data set. Tagging seems to be effective, since HumanHap300® in the RefPhase II 5 Kb set shows slightly higher power than the GeneChip® 500K in CEU, although the power is slightly biased by the higher representation of the Phase I SNPs in the RefPhase II 5 Kb set (Fig. 4D). HumanHap300® shows comparable power to that of GeneChip® 500K, but the power of HumanHap300® is significantly reduced in YRI. In HumanHap550®, more tag SNPs from YRI and JPT+CHB were added to HumanHap300®, the power is more improved in YRI and in JPT+CHB, but the power is also increased to a lesser degree in CEU reflecting a transferability of tag SNPs between CEU and JPT+CHB. The power of various commercially available platforms with various sample sizes are shown in Figure 4E (adaptive threshold) and in Supplementary Material, Figure S7 (fixed threshold). Genome coverage and power of HumanHap550® in the CEU are comparable to those of the random 1000K set (Supplementary Material, Figure S4), an equivalent to Human SNP Array 6.0® that is planned by Affymetrix® (Fig. 4E). Nevertheless, and in spite of the significant difference in cost, the gain of power in HumanHap550® is not so prominent. Also note that the power calculation for HumanHap550® could be slightly biased by using the subset of the Phase II SNPs as a reference.
Power depends on allele frequencies of causative alleles
Power strongly depends on MAF of causative alleles, and detecting rare causative alleles is very difficult (Fig. 2) (8,20) for two reasons. First, rare variants are difficult to capture in high r2 values. With currently available platforms (GeneChip® 500K or HumanHap550®), most SNPs with more than 0.10 MAF values are captured in high r2, which could be effectively detected in high power given moderate GRRs (
1.5) and sample size (
1000/arm) (Fig. 5). In contrast, capturing rare causal SNPs (MAF<0.10) requires many more marker SNPs or their combinations than capturing common SNPs at the more cost of multiple hypothesis testing. Second, even when captured in high r2 with one or more marker SNPs, associations with these rare SNPs are more difficult to detect than those with common SNPs (Fig.5). In common diseases, the existence of multiple phenocopy variants would further compromise detection (multiple rare variants) (33,34). Thus, regardless of genome coverage, power is consistently lower for less common SNPs (Fig. 6A and C). To detect rare causative SNPs, we need not only to invest in genotyping large numbers of marker SNPs with low MAF values by any means, but also to increase the sample size (Fig. 6B and C).
|
|
Discussion
Through the current analysis, we empirically determined the size of test statistics for causal as well as null markers under varying degrees of genome coverage and realistic study parameters, and thereby demonstrated how genome-wide power is affected by the interplay between genome-coverage and other determinants. Here it is appropriate to compare the performance [power (1–ß) or sensitivity] of the different SNP sets with their specificity (or 1–
) being constant by applying adaptive thresholds, where
denotes genome-wide type I error probability. In addition, the power calculated in this way is directly related to false positive report probability (FPRP), which is simply expressed as 1/[1+(1–ß)/
], which is approximately extended to 1/[1+m(1–ß)/
] assuming a total of m independent causative loci having the same effect size. Note that
is a constant for all SNP sets, i.e. 0.05 or 0.01. So from our simulations, readers will easily evaluate the power and FPRP expected form given SNP set, sample size and predicted effect size. As long as practical power (for example, 1–ß >
) is obtained, FPRP is expected to less than 0.5, which will be satisfactory for initial discovery studies.
We estimated genome-wide thresholds based on the simulations using small numbers of HapMap chromosomes. In real studies, the threshold should be determined using their own applicable data sets, where diploid, rather than phased, chromosomes could be used when enough samples are analyzed. A larger number of chromosomes should contain more numbers of rare segregating SNPs, but these rare SNPs would not increase
2 thresholds substantially (22).
In terms of the effective number of independent SNPs (Nc) in various marker sets, the diversity of the human genome is likely to be on the order of 1000K in CEU and the corresponding nominal P-value giving a genome-wide
error of 0.05 is 5 x 10–8. For moderate GRRs (
1.5), this threshold could be overcome with
1500 samples per arm for very common SNPs (MAF>0.20), but for less common SNPs or those with a small genetic effect (GRR=1.1–1.2), extremely large numbers of samples will be required (Supplementary Material, Figure S8), which urges moves toward sharing typing data across multiple groups as exemplified in recent reports that identified predisposing factors with very modest genetic effects for type 2 diabetes (35–37). The diversity of our genome may not allow for detecting very rare causative alleles (<0.01) with even smaller genetic effects (i.e. GRR<1.1) using this approach (Fig. 6D).
Under these limitations, several issues should be considered to efficiently exploit study resources and to increase the chance of finding a true association. First, for the increased genome coverage to be effectively translated into power, it needs to be accompanied by a corresponding increase in sample size. When sample numbers are small relative to the effect size, the cost of multiple testing largely offsets the expected increase in the test statistics for causal alleles with no measurable gain in power, and can even exceed the gain in causal distributions (Fig. 4). Increasing genome coverage with insufficient sample sizes would only consume resources with no substantial benefit in power. In addition, power tends to saturate in higher genome coverage and the effect of increasing the number of marker SNPs is less prominent compared to that of increasing sample sizes. In most simulated situations, more power is expected by doubling the sample size than by doubling the number of maker SNPs. For example, our simulations predict that doubling the sample size using GeneChip® Nsp 250K is almost certainly more efficient than analyzing half of the samples with both Nsp 250K+Sty 250K (Supplementary Material, Figure S9).
The tagging strategy or statistical imputation is effective for increasing genome coverage with limited numbers of marker SNPs (21,38,39), although it does not save the cost of multiple-hypothesis testing. The efficiency of generating a tag SNP set with higher genome coverage, however, is increasingly compromised. The additional gain in power becomes smaller with increasing genome coverage, while more and more effort will be required to find additional independent tag SNPs, because many SNPs are already captured by existing tag SNPs. In addition, we simulated power using All Phase II set. In the sense that all references are captured through direct association, this marker set provides the ultimate coverage of the genome. Considering that modest increase of power using All Phase II set compared with random 1000K set (Fig. 3), multimarker tagging presumably may not push up the power profoundly. Transferability of a tag SNP set from one population to another is also a problem. Tag SNPs for CEU are transferable to a certain degree to JPT+CEU, but they are less effective for YRI.
In any simulated scenarios, detecting SNPs with lower MAF values (0.05–0.10) is very difficult using whole genome approaches, which is especially true for SNPs with less than 0.05 MAF values. In this situation, genome coverage to capture these rare SNPs becomes definitely important, but the required increase in the sample size is greater for rare SNPs than for common ones. Effort to devising SNP sets for these rare alleles, or exhaustive multimarker tests (21,38), is not likely to be rewarding unless their genetic effects are substantially large.
| MATERIALS AND METHODS |
|---|
|
|
|---|
HapMap data sets
The phased genotyping data of the HapMap Phase II (release 21) were obtained from the International HapMap Project web site (http://www.hapmap.org/downloads/phasing/2006-07_Phase II/) (10). It includes the data from 60 CEU parents (120 chromosomes), 60 YRI parents (120 chromosomes) and the combined set of 45 JPT and 45 CHB unrelated individuals (180 chromosomes), and is provided in three discrete sets (all, consensus, and phased), of which we used the former two sets for analysis. The all set contains the comprehensive data of all SNPs genotyped in each population including non-segregating sites, and the consensus set consists of the intersection of all sets from the three population panels. The all sets contain 3755 469, 368 5205 and 3776 850 SNPs for CEU, YRI and JPT+CHB, respectively, and the consensus set includes 3535 396 SNPs.
Marker sets and the references for power calculation
We generated a series of marker sets consisting of 10K, 30K, 50K, 125K, 250K, 500K and 1000K SNPs, by randomly selecting SNPs from the Phase II all sets for each HapMap panel. The number of segregating SNPs in each set is denoted as Ns and shown in Table 1for CEU panel. Because the Phase II all set contains most of the SNPs on commercially available platforms, including Affymetrix® GeneChip® 500K (Nsp+Sty), 250K (Nsp), 100K (Hind+Xba), Illumina® HumanHap300®, and HumanHap550® (Supplementary Material, Table S1), the intersectional SNPs of these platforms with the Phase II all set were incorporated into the analysis as representative SNPs of each commercial set. Annotation files for SNPs on GeneChip® series are available from the Affymetrix® web site (http://www.affymetrix.com/products/application/whole_genome.affx). The SNP information of HumanHap® series was kindly provided by Illumina® Inc. A subset of the Phase II SNPs, referred to as 'RefPhase II 5 Kb', was constructed and used as a reference in the calculation of genome-wide powers by randomly selecting SNPs from the consensus set so that each SNP is, on average, 5 Kb apart from the adjacent SNPs. Combined SNPs from the 10 ENCODE regions, denoted as RefENCODE, were used as an alternative reference set. Only common SNPs (MAF
0.05) were included in the power calculations as putative causal alleles.
Simulation of case-control panels under the null hypothesis and fitting simulated distributions
Null distributions in genetic association studies are considered for only vaguely defined ensembles having limited population sizes, e.g. all adult Japanese eligible for a study. To obtain asymptotic distributions, we generated 10 000 null case-control panels by randomly resampling phased autosomal chromosomes from the all set of CEU, YRI and JPT+CHB. Simulations were performed with different sample numbers, i.e. 500, 750, 1000, 1500, 2000 and 4000 per single arm. For each case-control panel, the maximum
2 value (max(
2); d.f.=1) in the standard allele test was calculated for different marker sets to obtain empirical null distributions of max(
2).
The simulated distributions,
(
2), were fitted to the null distribution for hypothetical Nc independent SNPs,
Nc(
2), by the least squares method as follows:
|
|
Simulation of case-control studies and calculation of power
We consider multiplicative disease models showing a prevalence e, and assume a single causative allele whose MAF and GRR are P (
0.05) and
, respectively. Given the penetrance for AA, Aa and aa genotypes as fAA, fAa, and faa, respectively, expected genotype frequencies in the case and control panels are given as,
|
|
|
|
2) of the marker SNPs for each panel, where the calculations were performed only for those marker SNPs that are within 500 Kb from the putative causal SNP. The proportion of simulated case-control panels whose max(
2) exceeded the upper 95 or 99% point of the corresponding null distribution for that marker set was defined as the power. The genome-wide power was computed by averaging each power for all SNPs within the reference set. As the number of marker SNPs increases, up to as high as 1000K, there is a considerable chance of detecting direct associations, i.e. the causative SNP is included in the marker set. Assuming 7500K common SNPs within the human genome (17), the Phase II data set includes one-fourth (2167K common SNPs in CEU) of all the common SNPs. Based on this estimation, we excluded three-fourths of the direct associations from the calculation of genome-wide power to avoid overestimating its chance. The adjustment of direct association, however, has little influence on the results. This correction was not applied to the power calculation on the RefENCODE set, because it represents the nearly complete data set for those regions.
Computational resources
All simulations were run on the GXP clustering computer system in the Department of Information and Communication Engineering, Graduate School of Information Science, University of Tokyo.
| SUPPLEMENTARY MATERIAL |
|---|
|
|
|---|
Supplementary Material is available at HMG Online.
| FUNDING |
|---|
|
|
|---|
This work was supported by Research on Measures for Intractable Diseases, Health and Labor, Sciences Research Grants, Ministry of Health, Labor and Welfare, Research on Health Sciences focusing on Drug Innovation, the Japan Health Sciences Foundation, and Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency.
| ACKNOWLEDGEMENTS |
|---|
This work is totally indebted to the achievement of the International HapMap Consortium and we thank all the people who participated in the project. We also thank Jun Ohashi for helpful discussions.
Conflict of Interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
-
Risch N., Merikangas K. The future of genetic studies of complex human diseases. Science (1996) 273:1516–1517.
[Abstract/Free Full Text] - Kruglyak L. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. (1999) 22:139–144.[CrossRef][Web of Science][Medline]
- Risch N.J. Searching for genetic determinants in the new millennium. Nature (2000) 405:847–856.[CrossRef][Medline]
- Syvanen A.C. Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat. Rev. Genet. (2001) 2:930–942.[CrossRef][Web of Science][Medline]
- Kennedy G.C., Matsuzaki H., Dong S., Liu W.M., Huang J., Liu G., Su X., Cao M., Chen W., Zhang J., et al. Large-scale genotyping of complex DNA. Nat. Biotechnol. (2003) 21:1233–1237.[CrossRef][Web of Science][Medline]
- Fan J.B., Chee M.S., Gunderson K.L. Highly parallel genomic assays. Nat. Rev. Genet. (2006) 7:632–644.[CrossRef][Web of Science][Medline]
- Hirschhorn J.N., Daly M.J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. (2005) 6:95–108.[Web of Science][Medline]
- Wang W.Y., Barratt B.J., Clayton D.G., Todd J.A. Genome-wide association studies: theoretical and practical concerns. Nat. Rev. Genet. (2005) 6:109–118.[CrossRef][Web of Science][Medline]
- The International HapMap Consortium. The International HapMap Project. Nature (2003) 426:789–796.[CrossRef][Medline]
- The International HapMap Consortium. A haplotype map of the human genome. Nature (2005) 437:1299–1320.[CrossRef][Medline]
- Johnson G.C., Esposito L., Barratt B.J., Smith A.N., Heward J., Di Genova G., Ueda H., Cordell H.J., Eaves I.A., Dudbridge F., et al. Haplotype tagging for the identification of common disease genes. Nat. Genet. (2001) 29:233–237.[CrossRef][Web of Science][Medline]
-
Gabriel S.B., Schaffner S.F., Nguyen H., Moore J.M., Roy J., Blumenstiel B., Higgins J., DeFelice M., Lochner A., Faggart M., et al. The structure of haplotype blocks in the human genome. Science (2002) 296:2225–2229.
[Abstract/Free Full Text] - Carlson C.S., Eberle M.A., Rieder M.J., Yi Q., Kruglyak L., Nickerson D.A. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. (2004) 74:106–120.[CrossRef][Web of Science][Medline]
- Halldorsson B.V., Istrail S., De La Vega F.M. Optimal selection of SNP markers for disease association studies. Hum. Hered. (2004) 58:190–202.[CrossRef][Web of Science][Medline]
-
Zhang K., Qin Z., Chen T., Liu J.S., Waterman M.S., Sun F. HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics (2005) 21:131–134.
[Abstract/Free Full Text] -
Ao S.I., Yip K., Ng M., Cheung D., Fong P.Y., Melhado I., Sham P.C. CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs. Bioinformatics (2005) 21:1735–1736.
[Abstract/Free Full Text] - Barrett J.C., Cardon L.R. Evaluating coverage of genome-wide association studies. Nat. Genet. (2006) 38:659–662.[CrossRef][Web of Science][Medline]
- Peer I., de Bakker P.I., Maller J., Yelensky R., Altshuler D., Daly M.J. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat. Genet. (2006) 38:663–667.[CrossRef][Web of Science][Medline]
- Ohashi J., Tokunaga K. The power of genome-wide association studies of complex disease genes: statistical limitations of indirect approaches using SNP markers. J. Hum. Genet. (2001) 46:478–482.[CrossRef][Web of Science][Medline]
- Zondervan K.T., Cardon L.R. The complex interplay among factors that influence allelic association. Nat. Rev. Genet. (2004) 5:89–100.[Web of Science][Medline]
- de Bakker P.I., Yelensky R., Peer I., Gabriel S.B., Daly M.J., Altshuler D. Efficiency and power in genetic association studies. Nat. Genet. (2005) 37:1217–1223.[CrossRef][Web of Science][Medline]
- Neale B.M., Sham P.C. The future of association studies: gene-based analysis and replication. Am. J. Hum. Genet. (2004) 75:353–362.[CrossRef][Web of Science][Medline]
- Dudbridge F., Koeleman B.P. Rank truncated product of P-values, with application to genomewide association scans. Genet. Epidemiol. (2003) 25:360–366.[CrossRef][Web of Science][Medline]
- Hoh J., Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nat. Rev. Genet. (2003) 4:701–709.[CrossRef][Web of Science][Medline]
-
Hoh J., Wille A., Ott J. Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res. (2001) 11:2115–2119.
[Abstract/Free Full Text] - Zaykin D.V., Zhivotovsky L.A., Westfall P.H., Weir B.S. Truncated product method for combining P-values. Genet. Epidemiol. (2002) 22:170–185.[CrossRef][Web of Science][Medline]
-
De La Vega F.M., Isaac H., Collins A., Scafe C.R., Halldorsson B.V., Su X., Lippert R.A., Wang Y., Laig-Webster M., Koehler R.T., et al. The linkage disequilibrium maps of three human chromosomes across four populations reflect their demographic history and a common underlying recombination pattern. Genome Res. (2005) 15:454–462.
[Abstract/Free Full Text] - Gunderson K.L., Steemers F.J., Lee G., Mendoza L.G., Chee M.S. A genome-wide scalable SNP genotyping assay using microarray technology. Nat. Genet. (2005) 37:549–554.[CrossRef][Web of Science][Medline]
- Matsuzaki H., Dong S., Loi H., Di X., Liu G., Hubbell E., Law J., Berntsen T., Chadha M., Hui H., et al. Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat. Methods (2004) 1:109–111.[CrossRef][Web of Science][Medline]
- Steemers F.J., Chang W., Lee G., Barker D.L., Shen R., Gunderson K.L. Whole-genome genotyping with the single-base extension assay. Nat. Methods (2006) 3:31–33.[CrossRef][Web of Science][Medline]
- Tenesa A., Dunlop M.G. Validity of tagging SNPs across populations for association studies. Eur. J. Hum. Genet. (2006) 14:357–363.[CrossRef][Web of Science][Medline]
- de Bakker P.I., Burtt N.P., Graham R.R., Guiducci C., Yelensky R., Drake J.A., Bersaglieri T., Penney K.L., Butler J., Young S., et al. Transferability of tag SNPs in genetic association studies in multiple populations. Nat. Genet. (2006) 38:1298–1303.[CrossRef][Web of Science][Medline]
- Pritchard J.K. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. (2001) 69:124–137.[CrossRef][Web of Science][Medline]
- Slager S.L., Huang J., Vieland V.J. Effect of allelic heterogeneity on the power of the transmission disequilibrium test. Genet. Epidemiol. (2000) 18:143–156.[CrossRef][Web of Science][Medline]
-
Scott L.J., Mohlke K.L., Bonnycastle L.L., Willer C.J., Li Y., Duren W.L., Erdos M.R., Stringham H.M., Chines P.S., Jackson A.U., et al. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants. Science (2007) 316:1341–1345.
[Abstract/Free Full Text] -
Saxena R., Voight B.F., Lyssenko V., Burtt N.P., de Bakker P.I., Chen H., Roix J.J., Kathiresan S., Hirschhorn J.N., Daly M.J., et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science (2007) 316:1331–1336.
[Abstract/Free Full Text] -
Zeggini E., Weedon M.N., Lindgren C.M., Frayling T.M., Elliott K.S., Lango H., Timpson N.J., Perry J.R., Rayner N.W., Freathy R.M., et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science (2007) 316:1336–1341.
[Abstract/Free Full Text] - Lin S., Chakravarti A., Cutler D.J. Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat. Genet. (2004) 36:1181–1188.[CrossRef][Web of Science][Medline]
-
Weale M.E., Depondt C., Macdonald S.J., Smith A., Lai P.S., Shorvon S.D., Wood N.W., Goldstein D.B. Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage-disequilibrium gene mapping. Am. J. Hum. Genet. (2003) 73:551–565.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
M. Kamei, Y. Nannya, H. Torikai, T. Kawase, K. Taura, Y. Inamoto, T. Takahashi, M. Yazaki, S. Morishima, K. Tsujimura, et al. HapMap scanning of novel human minor histocompatibility antigens Blood, May 21, 2009; 113(21): 5041 - 5048. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. N. Gong Gene association studies in acute lung injury: replication and future direction Am J Physiol Lung Cell Mol Physiol, May 1, 2009; 296(5): L711 - L712. [Full Text] [PDF] |
||||
![]() |
D. R. Nyholt, K. S. LaForge, M. Kallela, K. Alakurtti, V. Anttila, M. Farkkila, E. Hamalainen, J. Kaprio, M. A. Kaunisto, A. C. Heath, et al. A high-density association screen of 155 ion transport genes for involvement with common migraine Hum. Mol. Genet., November 1, 2008; 17(21): 3318 - 3331. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









