Human Molecular Genetics Advance Access originally published online on July 21, 2005
Human Molecular Genetics 2005 14(17):2485-2488; doi:10.1093/hmg/ddi252
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
COMMENTARY |
Guidelines for conducting and reporting whole genome/large-scale association studies
1Design and Standards, GlaxoSmithKline, Research Triangle Park, NC, USA and 2Portfolio Genetics, GlaxoSmithKline, Stevenage, Herts, UK
* To whom correspondence should be addressed. Email: nigel_k_spurr{at}gsk.com
Received June 30, 2005; Accepted July 11, 2005
| Introduction |
|---|
|
|
|---|
Many studies have tested for association with a trait, using a few to hundreds, and in some cases thousands, of genetic variants. However, very few of the reported genetic associations for complex traits have been confirmed when tested in independent sample collections (1
We define a WGA study as the testing of genetic variants located densely throughout the genome for correlation with a trait of interest. WGA scans can take several forms including testing for association with putatively causal non-synonymous coding variants or with tagging variants. Tagging variants may be selected on the basis of even spacing or spaced according to patterns of linkage disequilibrium. In addition, the tagging variants can be limited to gene coding regions or spread throughout the entire coding and non-coding genome (9
12
). Technological approaches include pooling and individual genotyping using a range of platforms (13
15
). Patient samples may include multiple samples of nuclear families, cases and controls or other association designs. Several WGA studies have been performed on samples collected in small, isolated populations genotyped for thousands of STR markers (16
20
). With the rapid growth of inexpensive, high-throughput genotyping, studies within large, non-isolate populations are beginning to emerge (6
,21
,22
).
In contrast to DNA microarray analysis experiments, which are only meaningful in the context of the conditions under which they were generated including the state of the system under study, the results of WGA studies are dependent on some study-specific aspects and independent of others. WGA studies are performed on germline DNA with genotyping methods producing objective measures that can be repeated with diverse technologies with identical results. In addition, it is already a common practice to combine genotype data obtained for individuals, produced by diverse technologies and laboratories once quality control measures and study design have been determined (23
). Challenges to combining and merging data come with pooled DNA sample designs where interpretation is reliant on the technology used and estimates of the diverse sources of error. Interpretation of different study designs and approaches probably present the greatest challenges to genetic community when discussing results of WGA studies. Guidelines ensuring that the information needed to understand and interpret the data and results is included in publications will allow researchers to capitalize most effectively on the work done.
The decision to carry out a WGA study requires a substantial commitment of time and resources at many levels. It is therefore important that the study be developed in a way to maximize the likelihood of success. Common weaknesses that have plagued many association studies include study design failing to adequately identify true positives while eliminating false positives, poorly defined phenotypes and sampling from heterogeneous patient populations, inappropriately matched controls, small sample sizes relative to the magnitude of the genetic effects sought, failure to account for multiple testing, population and sample stratification, measuring the wrong SNPs, failure to replicate marginal findings and overoptimistic interpretation of study results. We discuss the information needed to adequately describe the study design, samples used, polymorphisms typed, analysis performed and interpretations made.
The goals of a study influence almost every aspect of the experimentation. Example goals for a WGA study include identify coding variants associated with Type 2 diabetes or identify common variants associated with asthma. Clearly defined study goals will help guide the selection of phenotypic and genetic measures, the study samples, statistical methods and the kinds of inferences that may be drawn.
A variety of study designs are amenable to WGA studies. The most cost-efficient is the casecontrol design, selecting one group of unrelated cases and a contrasting group of unrelated controls. The selection of cases and appropriate controls should be given careful consideration. Studies of quantitative traits, such as degree of drug response or circulating biomarker levels, can be carried out on all subjects with valid measures or be restricted to contrasting subjects selected from the tails of the trait distribution (24
). In addition to the casecontrol design, alternatives include multiplex casecontrol (multiple cases from one pedigree), family-based casecontrol (such as parentoffspring trios) and prospective cohort designs (25
).
The replication of positive results is critical to ensuring that variants identified are truly correlated with a trait (26
). Therefore, study designs should include several stages of experimentation, including a replication set of samples, designed to eliminate false positive findings and highlight true findings (27
). The study design should include descriptions of the various rounds of experimentation along with how results will be interpreted at each stage. This information will be critical in estimating the power of the study to identify genetic effects of different sizes as well as the importance to place on the final results obtained.
Once the study goals are understood, it is important to consider the range of phenotype definitions that could address those goals. Many diseases or other traits of interest may represent a very broad range of biological phenomena that may be influenced by an equally diverse range of genetic risk factors. For example, a breast cancer study would need to consider whether to include all breast cancers as cases, or restrict the cases to one particular histological type, such as ductal carcinoma. In general, studies with more precisely defined phenotypes will have greater chances of success and will increase the likelihood that the results will be replicated. The genetic contribution to particular phenotypic subgroups should also be considered. Many diseases have subtypes that are suspected of being under varying degrees of environmental influence. This is also true of various patient subpopulations, considered subsequently.
Most studies will rely on abundant single nucleotide polymorphisms (SNPs) to scan a whole genome. To adequately describe the variation studied throughout the genome, it is important to describe how the polymorphic markers were selected. Characteristics such as the allele frequencies of the variants (28
), the physical distance between markers and a description of how these markers may capture information from correlated markers are important for determining the comprehensiveness of the scan. The manner in which the markers were selected to be part of the scan may be described as an optimizing one of the earlier measures, ease and cost of genotyping or perhaps by other criteria such as markers within genes or non-synonymous SNPs. In addition to the descriptive polymorphism information, adequate sequence information must be available to ensure that each polymorphism is adequately described so that it can be repeated with other genotyping technologies. Owing to the high costs associated with developing a WGA marker panel, the selection of a panel will generally be coupled with the selection of a genotyping technology it has been developed on.
The samples to be selected for a WGA study need to be described with respect to several aspects. The issues that are important to address include population history and origin, the average linkage disequilibrium between markers (i.e. suitability of available marker panels to the population of inference), the possibility of population stratification, the availability of samples for replicating results found, the ascertainment scheme used and the power that can be achieved assuming effect levels on the basis of these assumptions and the study design used. These facets of a study determine the suitability of the design to achieve adequate power to ensure effects of relevant sizes are detectable. They also provide information that is relevant to combining data from diverse sources.
A WGA scan is likely to require the genotyping of a large number of polymorphic markers. To ensure that the data is of the highest quality, studies should include a description of the genotyping methods, quality control procedures and quality control metrics including the percentage of missing data and duplicate error rate. The number of data points in a whole genome screen can be quite large meaning there is an opportunity for errors to arise despite low error rates. The potential for confounding due to population and/or sample stratification should also be assessed and accounted for in subsequent association analyses. There are several methods available for testing for stratification and one or more of these should be included in the quality control of any such experiment.
Analysis of WGA scans will use many of the statistical tools available in more focused association studies, but the interpretation of the results will differ due to the large number of tests being performed. Again, as tens or hundreds of thousands of markers may be typed for a WGA study, large numbers of statistically significant results would be observed using traditional test-wise criteria, the vast majority of which would be due to sampling variability (e.g.
5% of all markers would have P-values <0.05). Therefore, the analysis plan must lay out a strategy for identifying true genetic risk factors. This should include replication in a suitable, independent sample, as well as the choice of more conservative significance levels (at the cost of reduced power) and/or biological/experimental validation. Moreover, when results are reported, information estimating the expected number of false positives or the probability of identifying the reported results, by chance, is important for putting the results into perspective with other investigations. To ensure that studies will be comparable and that researchers will be able to interpret analyses done, the analysis methods should include: descriptive analyses including the distribution of Hardy Weinberg equilibrium statistics, descriptions of the methods used with references to how the methods were validated, the number of tests performed and the significance levels used to interpret results.
The accompanying article by Nelson Freimer puts forward a set of guidelines for candidate gene and WGA experiments. Listed subsequently are the key issues that need to be addressed in any WGA study and will be a part of future assessments of these studies as they come for consideration for publication.
- Clear understanding of the criteria used to select samples for study to include geographical location, population or clinical-based selection, source of controls and ethnicity.
- Type of study design used, casecontrol, trios, sib-pairs etc.
- Population size and evidence of power calculations for the type of study, these should state the limitations of the study and the confidence intervals of any results described.
- Ability to replicate findings (if no suitable cohort exists then the results should be described as preliminary or in need of replication).
Genotyping and marker selection
- Description of platform.
- Clear description of markers used and criteria for selection.
- Files of the markers used to allow future replication of study (Supplementary Material).
- Description of quality control metrics including missing data and duplicate error rates.
- Description of the analysis methods used.
- Criteria used to handle multiple tests.
| SUPPLEMENTARY MATERIAL |
|---|
|
|
|---|
Supplementary Material is available at HMG Online.
Conflict of Interest statement. None declared.
| REFERENCES |
|---|
|
|
|---|
- Ioannidis, J.P., Ntzani, E.E., Trikalinos, T.A. and Contopoulos-Ioannidis, D.G. (2001) Replication validity of genetic association studies. Nat. Genet., 29, 306309.[CrossRef][Web of Science][Medline]
-
Corder, E.H., Saunders, A.M., Strittmatter, W.J., Schmechel, D.E., Gaskell, P.C., Small, G.W., Roses, A.D., Haines, J.L. and Pericak-Vance, M.A. (1993) Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer's disease in late onset families. Science, 261, 921923.
[Abstract/Free Full Text] - Altshuler, D., Hirschhorn, J.N., Klannemark, M., Lindgren, C.M., Vohl, M.C., Nemesh, J., Lane, C.R., Schaffner, S.F., Bolk, S., Brewer, C. et al. (2000) The common PPARgamma Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes. Nat. Genet., 26, 7680.[CrossRef][Web of Science][Medline]
- Deeb, S.S., Fajas, L., Nemoto, M., Pihlajamaki, J., Mykkanen, L., Kuusisto, J., Laakso, M., Fujimoto, W. and Auwerx, J. (1998) A Pro12Ala substitution in PPARgamma2 associated with decreased receptor activity, lower body mass index and improved insulin sensitivity. Nat. Genet., 20, 284287.[CrossRef][Web of Science][Medline]
-
Edwards, A.O., Ritter, R., III, Abel, K.J., Manning, A., Panhuysen, C. and Farrer, L.A. (2005) Complement factor H polymorphism and age-related macular degeneration. Science, 308, 421424.
[Abstract/Free Full Text] -
Klein, R.J., Zeiss, C., Chew, E.Y., Tsai, J.Y., Sackler, R.S., Haynes, C., Henning, A.K., Sangiovanni, J.P., Mane, S.M., Mayne, S.T. et al. (2005) Complement factor H polymorphism in age-related macular degeneration. Science, 308, 385389.
[Abstract/Free Full Text] - Cardon, L.R. and Abecasis, G.R. (2003) Using haplotype blocks to map human complex trait loci. Trends Genet., 19, 135140.[CrossRef][Web of Science][Medline]
-
Risch, N. and Merikangas, K. (1996) The future of genetic studies of complex human diseases. Science, 273, 15161517.
[Abstract/Free Full Text] - Cousin, E., Genin, E., Mace, S., Ricard, S., Chansac, C., del Zompo, M. and Deleuze, J.F. (2003) Association studies in candidate genes: strategies to select SNPs to be tested. Hum. Hered., 56, 151159.[CrossRef][Web of Science][Medline]
-
Nelson, M.R., Marnellos, G., Kammerer, S., Hoyal, C.R., Shi, M.M., Cantor, C.R. and Braun, A. (2004) Large-scale validation of single nucleotide polymorphisms in gene regions. Genome Res., 14, 16641668.
[Abstract/Free Full Text] -
Tsunoda, T., Lathrop, G.M., Sekine, A., Yamada, R., Takahashi, A., Ohnishi, Y., Tanaka, T. and Nakamura, Y. (2004) Variation of gene-based SNPs and linkage disequilibrium patterns in the human genome. Hum. Mol. Genet., 13, 16231632.
[Abstract/Free Full Text] - Wjst, M. (2004) Target SNP selection in complex disease association studies. BMC Bioinformatics, 5, 92.[CrossRef][Medline]
- Kwok, P.Y. and Chen, X. (2003) Detection of single nucleotide polymorphisms. Curr. Issues Mol. Biol., 5, 4360.[Medline]
-
Shi, M.M. (2001) Enabling large-scale pharmacogenetic studies by high-throughput mutation detection and genotyping technologies. Clin. Chem., 47, 164172.
[Abstract/Free Full Text] - Tsuchihashi, Z. and Dracopoli, N.C. (2002) Progress in high throughput SNP genotyping methods. Pharmacogenomics J., 2, 103110.[CrossRef][Medline]
- Coraddu, F., Lai, M., Mancosu, C., Cocco, E., Sawcer, S., Setakis, E., Compston, A. and Marrosu, M.G. (2003) A genome-wide screen for linkage disequilibrium in Sardinian multiple sclerosis. J. Neuroimmunol., 143, 120123.[CrossRef][Web of Science][Medline]
- Goris, A., Sawcer, S., Vandenbroeck, K., Carton, H., Billiau, A., Setakis, E., Compston, A. and Dubois, B. (2003) New candidate loci for multiple sclerosis susceptibility revealed by a whole genome association screen in a Belgian population. J. Neuroimmunol., 143, 6569.[CrossRef][Web of Science][Medline]
- Jonasdottir, A., Thorlacius, T., Fossdal, R., Jonasdottir, A., Benediktsson, K., Benedikz, J., Jonsson, H.H., Sainz, J., Einarsdottir, H., Sigurdardottir, S. et al. (2003) A whole genome association study in Icelandic multiple sclerosis patients with 4804 markers. J. Neuroimmunol., 143, 8892.[CrossRef][Web of Science][Medline]
- Martins, S.B., Thorlacius, T., Benediktsson, K., Pereira, C., Fossdal, R., Jonsson, H.H., Silva, A., Leite, I., Cerqueira, J., Costa, P.P. et al. (2003) A whole genome association study in multiple sclerosis patients from north Portugal. J. Neuroimmunol., 143, 116119.[CrossRef][Web of Science][Medline]
- Laaksonen, M., Jonasdottir, A., Fossdal, R., Ruutiainen, J., Sawcer, S., Compston, A., Benediktsson, K., Thorlacius, T., Gulcher, J. and Ilonen, J. (2003) A whole genome association study in Finnish multiple sclerosis patients with 3669 markers. J. Neuroimmunol., 143, 7073.[CrossRef][Web of Science][Medline]
-
Kammerer, S., Roth, R.B., Reneland, R., Marnellos, G., Hoyal, C.R., Markward, N.J., Ebner, F., Kiechle, M., Schwarz-Boeger, U., Griffiths, L.R. et al. (2004) Large-scale association study identifies ICAM gene region as breast and prostate cancer susceptibility locus. Cancer Res., 64, 89068910.
[Abstract/Free Full Text] - Ozaki, K., Ohnishi, Y., Iida, A., Sekine, A., Yamada, R., Tsunoda, T., Sato, H., Hori, M., Nakamura, Y. and Tanaka, T. (2002) Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction. Nat. Genet., 32, 650654.[CrossRef][Web of Science][Medline]
- The International HapMap Consortium (2003) The International HapMap Project. Nature, 426, 789796.[CrossRef][Medline]
- Schork, N.J., Nath, S.K., Fallin, D. and Chakravarti, A. (2000) Linkage disequilibrium analysis of biallelic DNA markers, human quantitative trait loci, and threshold-defined case and control subjects. Am. J. Hum. Genet., 67, 12081218.[Web of Science][Medline]
- Cardon, L.R. and Bell, J.I. (2001) Association study designs for complex diseases. Nat. Rev. Genet., 2, 9199.[CrossRef][Web of Science][Medline]
- Hirschhorn, J.N., Lohmueller, K., Byrne, E. and Hirschhorn, K. (2002) A comprehensive review of genetic association studies. Genet. Med., 4, 4561.[Web of Science][Medline]
- van den Oord, E.J. and Sullivan, P.F. (2003) A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Hum. Hered., 56, 188199.[Web of Science][Medline]
-
Wang, W.Y., Barratt, B.J., Clayton, D.G. and Todd, J.A. (2005) Genome-wide association studies: theoretical and practical concerns. Nat. Rev. Genet., 6, 109118.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
J. Little, J. P.T. Higgins, J. P.A. Ioannidis, D. Moher, F. Gagnon, E. von Elm, M. J. Khoury, B. Cohen, G. Davey-Smith, J. Grimshaw, et al. STrengthening the REporting of Genetic Association Studies (STREGA): An Extension of the STROBE Statement Ann Intern Med, February 3, 2009; 150(3): 206 - 215. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. T. Miller, P. M. Ridker, P. Libby, and D. J. Kwiatkowski Atherosclerosis: The Path From Genomics to Therapeutics J. Am. Coll. Cardiol., April 17, 2007; 49(15): 1589 - 1599. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

