Human Molecular Genetics, 2002, Vol. 11, No. 4 419-429
© 2002 Oxford University Press
Heterogeneity of linkage disequilibrium in human genes has implications for association studies of common diseases
INSERM U525, Faculté de Médecine, 91 Boulevard de lHôpital, 75634 Paris, France
Received October 17, 2001; Revised and Accepted December 19, 2001.
| ABSTRACT |
|---|
|
|
|---|
Linkage disequilibrium (LD) is the central concept of genetic association studies. Although LD has been shown not to be uniformly distributed across the genome, limited information is available about the characteristics of LD within candidate genes at large. We screened coding and regulatory regions of 50 candidate genes for cardiovascular diseases and identified 228 polymorphisms. The overall sequence diversity was 3.81 ± 0.31 x 104. Intragenic LD was generally very strong, with 40% of the 464 pairs of polymorphisms exhibiting a complete LD. However, if we consider
D'
= 0.7 as an arbitrary limit for useful LD in association studies, 26% of the pairs fell below this threshold, half of which being in negative LD, a situation where LD is even more difficult to detect. Non-synonymous coding polymorphisms, which are more likely to have a functional role, were more represented among low-frequency alleles and were more often in complete negative LD with other polymorphisms. This implies that in many situations the power to detect the effect of a non-synonymous polymorphism by measuring a nearby marker might be low. Although intragenic LD was partly a function of physical distance, gene-specific patterns of LD were observed, making it difficult to provide general guidelines for selecting the most useful polymorphisms in association studies. For all these reasons, association studies should concentrate on the overall sequence variation of functionally important regions of candidate genes and not only on a few polymorphisms. The variability of important intergenic regions identified by different approaches including comparative genomics will also have to be assessed. | INTRODUCTION |
|---|
|
|
|---|
Recently, a considerable interest has been raised in the evaluation of sequence diversity and linkage disequilibrium (LD) across the human genome. This interest is justified by the prospect that new high-throughput technology like DNA chips will make it possible to perform whole-genome association studies using dense maps of single nucleotide polymorphisms (SNPs) to identify common disease genes.
The ability to detect association between marker alleles and disease critically depends on the extent of LD between disease-causing alleles and surrounding marker alleles. Studies concentrating on the whole genome or on extended regions surrounding genes of interest have consistently showed that LD was not uniformly distributed across the genome but rather clustered in specific genomic regions (16). Despite the strong relationship existing at the genome scale between the amount of LD and physical distance, LD is not a strict monotonic function of distance and physically close markers are not necessarily in strong LD with one another (3,4,7,8). Very little is known about factors that govern the distribution of LD in the genome. Chromosome region-specific effects appear more important than population-specific effects in influencing the extent of LD (5). Actually, differences in LD levels between populations of recent common ancestry proved to be modest, even when comparing isolated and mixed populations (4,5,79). Only in small subisolates such as the village of Gavoi in Sardinia (5) or the Saami population in Finland (10) has LD been found stronger than in neighboring populations. On the other hand, LD can substantially differ among populations originating from different continents (9,11,12).
The great heterogeneity of spatial distribution of LD across the genome is a crucial aspect for genome-wide association studies. Indeed, the success of such an approach is highly dependent of the density of SNP maps, which itself should vary according to the regions of the genome. For example, LD has been reported to be stronger in GC-poor sequences due to a lower recombination frequency in these regions (13). Whereas the Human Genome Project initially envisaged a map of 100 000 SNPs (14), other suggestions based on simulations or extrapolations from empirical data varied from 30 000 to 500 000 SNPs or even more (1517). As part of the International SNP Map Working Group, a high-density SNP map of the human genome consisting of 1.4 million SNPs was recently published (18). However, most of these SNPs are predicted from sequence data and have not been yet been confirmed by genotyping assays. Moreover, SNPs of the databases represent only a small fraction of all SNPs (19).
The alternative approach to genome-wide surveys is to focus on polymorphisms located within candidate genes. Even within candidate genes, the extent of LD between putative functional variants and neutral markers is a critical element for the efficiency of the strategy. Although several recent studies have focused on the sequence diversity in candidate genes (2026), limited information is available about the extent of LD within candidate genes at large. Most previous studies on LD focused on a single gene or regions surrounding genes (3,12,24,2630). One study reported data on LD distribution for 114 SNPs from 33 genes, but this work was not based on a systematic molecular screening of genes (9). A recent survey of the variability of 313 human genes in 82 individuals of different ancestry showed a great variability of LD across genes, leading the authors to conclude that LD should be determined empirically for any specific genomic region (31).
As part of an ongoing program of exploration of candidate genes for cardiovascular diseases, we screened coding and regulatory regions of 50 human genes. The genes code for growth factors, cytokines, neuromediators, receptors, ligands, and proteins involved in lipid metabolism, vascular and cardiac trophicity, thrombosis and adhesion. The 228 polymorphisms identified by this screening were further genotyped in a population-based sample of 750 European subjects, as part of a larger case-control study on myocardial infarction, the Etude Cas-Témoin sur lInfarctus du Myocarde (ECTIM) study (32). The large number of polymorphisms allowed us to analyze thoroughly the factors that influence the gene sequence diversity and the extent of LD in candidate genes. More detailed information about data included in this paper can be found on our Internet site (http://genecanvas.idf.inserm.fr/).
| RESULTS |
|---|
|
|
|---|
Screening of genes
Our program of detection of polymorphisms initially consisted in comparison of 40 chromosomes of 20 unrelated patients with myocardial infarction selected from the ECTIM study (33). This sample size, used for the first 38 genes explored, provided an
90% power to detect alleles with a frequency >0.05, and a 33% power to detect alleles with a frequency of 0.01. In order to increase the probability of detecting rare variants, the number of screened chromosomes from the ECTIM patients was increased up to 190 for the 12 genes further explored. With this sample size, the power to detect alleles with a frequency of 0.01 increased to 85%. The screening was performed on patients to increase the probability of identifying disease-predisposing variants. However, this approach might have missed some rare protective variants.
Description of genes
Sixteen genes were located on the short arm and 34 on the long arm of chromosomes (Table 1). The exact location of genes on chromosomes was obtained from the Ensembl database (http://www.ensembl.org/). From the length of chromosomes provided in the draft of the human genome sequence (34), the short and long arms of each chromosome were divided into quartiles and three distinct chromosomal regions were defined: telomeric (quartiles proximal to pter and qter), centromeric (quartiles surrounding the centromere) and intermediate (two middle quartiles of each arm). According to this classification, 12 genes were located in centromeric, 26 in intermediate and 12 in telomeric regions (Table 1).
|
The characteristics of the 50 genes were roughly similar to the basic characteristics of human genes provided by the draft genome sequence (34), except the exon length which was larger in our study since it comprised untranslated exonic sequences (Table 2).
|
In the set of 50 genes, the total lengths screened were 60.8 kb in 5' regions, 7.8 kb in 5'-untranslated region (5'-UTR), 67.6 kb in coding sequences, 9.9 kb in 3' regions, and 23.6 kb in 3'-UTR. The total number of identified polymorphisms was 228 [mean per gene: 4.6 (range 024)], or a frequency of one polymorphism per 744 bp. The number of polymorphisms identified in each gene highly correlated with the sequence length screened (Spearmans correlation coefficient: 0.51, P < 103). Of the 228 polymorphisms, 143 (63%) were in non-coding regions and 85 (37%) in coding sequences. Among the coding polymorphisms, 54 (64%) led to replacement of an amino acid, a proportion similar to that recently reported in 313 human genes (31). Among non-synonymous polymorphisms, 33% created a non-conservative amino acid change defined according to the BLOSUM62 substitution matrix (35). The vast majority of polymorphisms were SNPs (87%), while 7% were insertion/deletion variations and 6% were repeat polymorphisms.
The 228 polymorphisms identified by molecular screening were then genotyped in a sample of 750 European male subjects aged 2564 years serving as control group in the ECTIM case-control study on myocardial infarction (32). This sample representative of the general population allowed us to estimate different population genetics parameters, including allele and haplotype frequencies, sequence diversity and LD measures.
Distribution of allele frequency
Among diallelic polymorphisms, the mean frequency ± SD of the minor allele was 0.19 ± 0.16 and the mean heterozygosity was 0.26 ± 0.18. Allele frequency did not significantly differ between gene regions (Fig. 1). Thirty-nine percent of polymorphisms had a minor allele frequency lower than 0.10. The distribution of low-frequency alleles was not significantly different between coding and regulatory regions (36 versus 44%, P = 0.22), but low-frequency alleles were more represented among non-synonymous than among synonymous coding SNPs (54 versus 26%, P = 0.018), and among non-degenerate than among 2-fold and 4-fold degenerate sites (56, 21 and 31%, respectively, P = 0.033), as also shown in Figure 1.
|
Sequence diversity
The mean sequence diversity
, according to location of genes on chromosomes and gene regions, is shown in Table 3. The overall
± SE across the 50 genes was 3.81 ± 0.31 x 104, which means that two randomly chosen sequences from the population were expected to differ approximately every 2600 bp. Diversity was not significantly different between the first set of genes screened using 40 chromosomes and the second set screened using 190 chromosomes (3.77 ± 0.34 versus 3.96 ± 0.74), even though rare variants are expected to be more represented in the latter set. However, the contribution of rare variants to nucleotide diversity is small because of their low heterozygosity.
|
There was a great variability of diversity across genes, with five genes being monomorphic [ADRB1, CSF1, ECE, HGF and FGA (in FGA, only the 5' region was screened)], and at the opposite, two genes having a diversity higher than 12 x 104 [LGALS3 and LPA (in LPA, only the 5' region was screened)] (Fig. 2).
|
Sequence diversity did not significantly differ between genes located on long arms and short arms of chromosomes, nor did it differ according to location on the chromosome (Table 3). Sequence diversity was higher in non-coding than in coding regions (4.30 ± 0.44 versus 3.06 ± 0.42 x 104, P = 0 .04), in synonymous than in non-synonymous sequences (5.63 ± 1.17 versus 2.26 ± 0.42 x 104, P = 0 .007) and in 4-fold degenerate than in non-degenerate sites (4.72 ± 1.23 versus 2.16 ± 0.43 x 104, P = 0.05) with intermediate values found in 2-fold generate sites (3.44 ± 1.06). Diversity also tended to be higher among conservative than among non-conservative sites (2.74 ± 0.64 versus 1.28 ± 0.42 x 104, P = 0.06). All these findings support the notion of higher selection pressure on nucleic acids that may affect protein function.
Pairwise LD patterns
LD could be estimated for 464 pairs of intragenic polymorphisms. LD was quantified using either the Lewontins standardized coefficient D' or the correlation coefficient r. D' has the advantage of being less sensitive to allele frequencies than is r (36), although it is not completely independent of them (37). By convention, LD is positive when the rare alleles of each polymorphism are preferentially associated and negative when the rare and the frequent alleles are associated (38). In the following, D' and r will refer to their absolute values. Both measures vary between 0 (absence of LD) and 1 (complete LD). However, r attains the maximum value of 1 only in case of complete association between the two polymorphisms, i.e. when two of the four haplotypes are missing, whereas D' =1 indicates a complete LD (one or two haplotypes are missing). The Spearmans correlation coefficient between D' and r was 0.39 (P < 104). The mean D' ± SD was 0.78 ± 0.31 and the median was 0.95 (range 01), whereas the mean r was 0.39 ± 0.34 and the median 0.25 (range 01), both measures reflecting a strong non-random association between intragenic polymorphisms. However, the distribution of LD also showed a non-negligible proportion of SNPs to be in weak disequilibrium with each other (Fig. 3). The proportions of negative and positive LD coefficients were 50/50%.
|
In 61% of pairs, all four pairwise haplotypes were observed. In the 39% remaining pairs, LD was complete, i.e. one haplotype (25% of pairs) or two haplotypes (14% of pairs) were missing. Two different mechanisms can generate a complete LD: either a recent mutation arising on an existing allele and generating a new rare haplotype, or the loss of existing haplotype(s) by selection or random drift. The first mechanism is expected in most cases to generate a negative LD since new mutations have a higher probability of arising on frequent alleles. This was actually observed since 79% of pairs with one missing haplotype exhibited a negative LD. Moreover, pairs with one missing haplotype more often involved a rare allele than other pairs of polymorphisms (0.07 versus 0.19 for the frequency of the less frequent allele of the pair, P < 104). The situation where two haplotypes were missing reflected a complete concordance between alleles of the pair. This complete association between polymorphisms could extend over more than two polymorphisms. For example, in the AGTR1 and the PDGFRA genes, the complete association involved six polymorphisms.
Factors influencing the magnitude of LD
LD did not significantly differ between p and q arms of chromosomes (median D': 0.91 versus 0.96, P = 0.55; median r: 0.32 versus 0.23, P = 0.22) but positively correlated to the length of chromosome arm (Spearmans correlation coefficient with D': 0.13, P = 0.004; with r: 0.19, P < 104). The proportion of complete LD was lower in telomeric regions than in intermediate and centromeric regions (16, 50 and 47%, respectively; P < 104), and accordingly D' was lower in telomeric regions (median: 0.85, 0.99 and 0.96, respectively; P < 104).
The magnitude of LD was stronger between adjacent than between non-adjacent polymorphisms (median D': 0.98 versus 0.91, P < 103; median r: 0.30 versus 0.24, P = 0.05). Accordingly, LD was stronger within regions than between regions of genes (Fig. 4A). For example, the median D' was 1 in pairs in which both polymorphisms were located in the 5' region (n = 135) and dropped to 0.5 in pairs involving one polymorphism at the 5' end of the gene and the other one at the 3' end (n = 36). A similar decrease was observed with r (Fig. 4A). This result indicates that even at the gene scale, LD is partly a function of physical distance. The 5' sequences were characterized by a higher proportion of pairs in complete LD or complete association (Fig. 3B). A likely explanation for this observation is that polymorphisms in the 5' regions were generally physically closer than polymorphisms in coding sequences which might be separated by large introns.
|
Pairs of polymorphisms were classified into three groups on the basis of allele frequencies: rare (at least one of the two polymorphisms with minor allele frequency <0.10, n = 169); common (both polymorphisms with minor allele frequency
0.20, n = 150); intermediate (all others, n = 142). There was a higher proportion of complete LD in the rare group (57, 35 and 24%, respectively; P < 104) and accordingly, D' was higher in this group (median: 1, 0.91 and 0.91 in rare, intermediate and common groups, respectively; P < 104). An inverse relationship was observed with r (median: 0.11, 0.28 and 0.56; P < 104), but this was only the reflection of the strong dependency between r and allele frequencies. Since the age of a mutation is proportional to its frequency, rare mutations on average have a more recent origin than common polymorphisms, which explains why LD is more extensive around them because there has been less time for recombination to break down LD. The rare group was also characterized by a higher proportion of negative LD (69, 44 and 35%, respectively; P < 104), since new mutations have a higher probability of arising on frequent alleles. Pairs of coding polymorphisms were classified into two groups according to whether both polymorphisms were silent changes (n = 14), or at least one polymorphism of the pair was non-synonymous (n = 78). In pairs involving non-synonymous substitutions, LD tended to be higher (median D': 0.93 versus 0.81, P = 0.06; median r: 0.32 versus 0.25, P = 0.05) but was more often of negative sign (59 versus 21%, P = 0.01) than in pairs composed of two silent substitutions. This was likely to be explained by the fact that non-synonymous polymorphisms more often belonged to the rare group. No significant difference was observed when pairs of non-synonymous polymorphisms were classified according to whether or not they involved a non-conservative substitution (data not shown).
Gene-average LD
The gene-average LD, defined as the mean of all the pairwise LD coefficients within the gene, could be estimated in 35 of the 50 genes. Among the 15 remaining genes, five were not polymorphic and six had only one polymorphism, and in the remaining four genes, the polymorphisms involved were too rare or multiallelic. In the 35 genes, the mean gene-average LD measured by D' was 0.88 ± 0.14, with a median of 0.93 (range: 0.441.00) (the maximum was reached in nine genes). The mean gene-average LD measured by r was 0.43 ± 0.28, with a median of 0.37 (range: 0.051.00) (maximum reached in two genes). The two gene-average LD measures weakly correlated, due to a large number of genes having a high gene-average D' and a low gene-average r (Fig. 5). Again, this reflects the greater dependency of r to allele frequencies. As expected from the relationship between LD extent and physical distance, we observed an inverse association between the gene-average LD quantified by D' and the gene length explored (Spearmans correlation coefficient: 0.51, P = 0.002). The correlation was lower when LD was quantified by r (Spearmans correlation coefficient: 0.17, P = 0.32).
|
Extended haplotypes
Because of the strong intragenic LD, each gene exhibited a limited number of extended haplotypes (i.e. haplotypes combining all polymorphisms), as already reported for a few genes (19). For example, in genes having four or more common SNPs, the number of main haplotypes observed (i.e. accounting for >80% of all observed chromosomes) represented in any gene less than 20% of all possible haplotypes that would exist in the absence of LD (Table 4). The largest number of haplotypes was observed in the ITGB2 gene in which eight common SNPs generated 18 main haplotypes. In contrast, in the APOB gene, the 19 common SNPs generated only 11 main haplotypes. In 13 out of the 21 genes shown in Table 4, no single haplotype had a frequency
50%, supporting the notion that, most often, there is no predominant form of a gene, as recently reported in a study of 313 human genes (31).
|
| DISCUSSION |
|---|
|
|
|---|
Recent systematic surveys of the sequence variation of human genes have provided a wealth of information on the natural variability of genes within and between populations. Human gene diversity has been shown to be greater in African populations than in European populations (20,21,31) and lower in Asian populations (25). Our data indicate that in European populations, two copies of an average gene chosen at random are expected to differ approximately every 2600 bp. Because rare variants were underdetected by the screening method of the first 38 genes, the figures reported here represent a lower bound of the actual DNA sequence diversity in the explored genes, although rare variants weakly contribute to the extent of diversity. Gene diversity was also probably underestimated by the fact that we screened only coding and regulatory regions of genes. Indeed, a higher diversity has been reported in introns than in coding regions (21,31). However, we do not know how intronic polymorphisms might affect the extent of LD within candidate genes.
Two recent systematic surveys of sequence variations in a large number of genes have reported slightly higher values of nucleotide diversity in American individuals of European descent than in our study (20,21). Several reasons might explain theses differences. The screening methods used [PCR/single strand conformation polymorphism (SSCP) in our study, high-density variants detection arrays and denaturing HPLC in the two other studies] may have different rates of false positives and false negatives for SNP discovery. Since all SNPs identified in our study were further genotyped in 750 subjects, we could exclude any false positive unlike the two other reports. On the other hand, the SSCP method has a sensitivity
90% and it is likely that we missed some variants. Another reason might be that the European population considered in our study had a greater genetic homogeneity than the American population of European descent, which is more admixed. Finally, as shown by the three referenced studies and by others (2326,39), there is a substantial variability in the sequence diversity among genes.
In accordance with previous reports (20,21), we found a lower diversity in non-synonymous than in synonymous sequences and among non-conservative than among conservative sites; two observations which are consistent with the operation of natural selection. Moreover, nucleotide diversity in non-coding sequences (mostly 5'- and 3'-UTR) was lower than in synonymous sequences. This finding is compatible with a greater selection pressure operating on 5'- and 3'-UTR regions because they contain sequences that are important for regulating gene expression and splicing. Actually, the rate of neutral substitution within genes can be the best approximated by the rate of substitution at 4-fold degenerate sites (40). From our data, we could estimate the sequence diversity produced by neutral evolution in coding sequences to be around 4.7 x 104, which means one difference per 2100 bp. The International SNP Map Working Group recently reported a nucleotide diversity for the whole human genome of 7.5 x 104 (18). Although this may be an over-estimation of the true diversity because all SNPs were not confirmed, it nevertheless suggests that diversity is much higher in intergenic than in intragenic regions of the genome.
The results of our study provide information on LD which should be useful for the design of association studies. In particular, they suggest that all polymorphisms in coding and regulatory sequences within a gene should be investigated, and not only a few markers selected a priori. Indeed, even at the gene scale, LD was not uniformly distributed, as already reported (31), and a significant proportion of intra-gene polymorphisms were found to be in weak LD with each other. This was especially true when polymorphisms were located in distant regions of the gene, indicating that, even at the gene scale, LD is partly a function of physical distance. However, because each gene, and even each sub-region of genes, has its own history and its own pattern of LD (12,31), it is nearly impossible to predict the extent of association between polymorphisms from their physical location and to determine which polymorphisms will be the most useful in association studies. For example, in the ß fibrinogen (FGB) gene, polymorphisms in nearly complete association were observed at both ends of the gene while intermediate polymorphisms were in weaker disequilibium (41). In that case, all the polymorphisms in complete association would provide the same information, but it would not be possible to predict which one is the most likely to be functional. An opposite example is that of the angiotensin II type 1 receptor (AGTR1) gene, in which all polymorphisms of the promoter were in strong association with each other but were not in LD with polymorphisms of the coding region because of the presence of a large intron (33). In that case, the effect of potential functional variants of the coding region would not be detected if only polymorphisms of the promoter were typed.
If we consider
D'
= 0.7 as an arbitrary limit for useful LD in association studies unless achieving prohibitive sample sizes (Fig. 6),
26% of the pairs would fall below this threshold. The minimal value of useful LD would be even higher in the presence of negative LD since in that case the power to detect LD is generally much lower (38). Besides the magnitude of LD, an element critical for the power of association tests is the closeness between the frequencies of disease-causing alleles and marker alleles (42,43). These problems are illustrated in Figure 6 showing the increase of sample size in case of discordance between maker and disease alleles, and in the presence of negative LD compared to positive LD. In our sample, only 22% of pairs of polymorphisms had a
r
value
0.7, reflecting both a high amount of LD and a closeness of allele frequencies.
|
We found that non-synonymous polymorphisms were more represented among low-frequency SNPs and were most often in negative LD with other markers. Figure 6 suggests that in many situations the power to detect the effect of a non-synonymous polymorphism by measuring a nearby marker might be low, even at the gene scale. The difficulty might be even greater when using SNPs from the public databases since they are biased towards high-frequency alleles (44) and in many cases there might be a great discordance between putatively functional variants and markers. For all these reasons, we believe not only that the gene-focused approach should be preferred to a genome-wide approach, but also that it should concentrate on the overall sequence variation of candidate genes and not only on a few polymorphisms, as also advocated by others (19,31). Another reason for favoring exhaustive gene-focused studies is the potential co-existence of several functional variants within the same gene, as shown now in several genes. Studying a single or few polymorphisms might obscure a complex genotypephenotype relationship which might be revealed only by extended haplotype analysis.
We found that the extent of LD was lower in telomeric regions than in centromeric regions of chromosomes, a finding which is in agreement with the results of a recent publication examining the genome-wide distribution of LD (6). This finding is likely to be explained by the higher recombination rates in telomeric regions (34). However, caution is needed in the interpretation of our results, since the number of genes in the telomeric and centromeric regions was small and we cannot rule out gene-specific effects. Similarly, the higher recombination rates reported on shorter chromosome arms (34) probably explain the positive correlation that we observed between the extent of LD and the length of chromosome arms.
One limitation of our study is that it was restricted to European individuals. Indeed, genetic diversity and LD extent have been shown to differ between populations of different origin such as African, European or Asian populations. From a perspective of population genetics, studying contrasted populations is certainly an advantage. Also, populations with greater genetic diversity like Africans may be more informative for fine-scale mapping of functional polymorphisms (45). However, if the purpose is to provide some guidelines useful for the design of association studies in a given population, information drawn from that population should be preferable, since a large number of rare polymorphisms are expected to be private to single populations (31).
Large research programs are currently under way in private and academic laboratories to identify all common forms of human genes. The whole-genome approach, requiring high-throughput technologies, will undoubtedly allow the identification of new genetic determinants of common traits. However, it will also undoubtedly miss a number of important genes. This is why this approach may fit, at least for some time, the objectives of private companies, while academic laboratories should probably concentrate on more exhaustive approaches.
With respect to the gene-focused approach, several important issues still deserve consideration. Should the exploration of genes be extended to all intronic polymorphisms, as suggested by some authors (24,30), which would mean multiplication of the number of polymorphisms by at least 10? How to deal with all the information generated by such extensive explorations? Which regions outside genes should be explored? The variability of important intergenic regions identified by different approaches including comparative genomics will also have to be assessed in the future.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The ECTIM population-based samples
The ECTIM study (32) is a case-control study of myocardial infarction conducted in Northern Ireland (Belfast) and France (Lille, Strasbourg and Toulouse). Patients with myocardial infarction were selected from the WHO MONICA (monitoring in cardiovascular disease) registers of each region. In each region, a control group composed of men aged 2564 years was randomly sampled from the population covered by the register. Subjects were of European origin and informed consent was obtained from all participants. All population-related statistics given in the present paper were estimated in the 750 population-based control subjects.
Screening of the genes and genotyping of polymorphisms
Genomic DNA was prepared from white blood cells by phenol extraction. For the first 38 genes, detection of polymorphisms was performed by comparing 40 chromosomes of 20 unrelated patients drawn from the ECTIM study. The number of chromosomes was increased up to 190 for the 12 following genes. The method of detection of polymorphisms was PCR/SSCP, followed by direct sequencing of fragments presenting a different SSCP pattern of migration, as described previously (22). Polymorphisms identified were then genotyped in the 750 control subjects by allele-specific oligonucleotides. All information needed for genotyping polymorphisms can be obtained at our Internet site (http://genecanvas.idf.inserm.fr/). The gene abbreviations used in this paper are taken from the OMIM database (http://www.ncbi.nlm.nih.gov/Omim).
Classification of polymorphisms
The protein sequence of each gene was obtained from the Network Protein Sequence @nalysis database (http://pbil.ibcp.fr/). The number of non-degenerate, 2-fold and 4-fold degenerate sites was calculated from the protein sequence. The number of synonymous sites was calculated as the sum of 4-fold degenerate sites and one-third of 2-fold degenerate sites, and the number of non-synonymous sites was the sum of non-degenerate sites and two-thirds of 2-fold degenerate sites (46). Conservative and non-conservative amino acid substitutions were defined according to the BLOSUM62 substitution matrix (35). For calculating the number of conservative and non-conservative sites, we applied the same proportions as those estimated in the study of 106 human genes by Cargill et al. (20).
Statistical analysis
Since, for the vast majority of polymorphisms, allele frequencies and pairwise LD coefficients did not statistically differ between Northern Ireland and France, subjects from the two populations were pooled for analysis. Allele frequencies were estimated by gene counting. The nucleotide diversity (
), defined as the expected number of differences, per nucleotide site, between a random pair of chromosomes, was calculated as the mean nucleotide heterozygosity averaged across all nucleotide sites, including monomorphic sites. The standard error was calculated as described in Nei (46). Comparison of diversity between gene regions was performed by t-test.
Pairwise LD was estimated by log-linear model analysis (47). LD was quantified using either the Lewontins coefficient D' (D' = D/Dmax/min) or the correlation coefficient r [r = D/(p1q1p2q2)1/2] (48). For each gene, the gene-average LD was calculated by averaging the absolute value of the k(k-1)/2 pairwise disequilibrium coefficients D' or r (k = number of diallelic polymorphisms within a gene). Comparison of LD coefficients between groups was performed by non-parametric KruskallWallis test. Frequencies of extended haplotypes combining all polymorphisms within a gene were estimated by use of the Arlequin program (49).
| ACKNOWLEDGEMENTS |
|---|
We thank Dominique Arveiler, Gérald Luc, Jean-Bernard Ruidavets and Alun Evans for recruitment of the ECTIM subjects, and Christiane Souriau for DNA extraction. The recruitment in the ECTIM study was supported by grants from the Squibb Laboratory, the British Heart Foundation, the Mutuelle Générale de lEducation Nationale, INSERM and the Institut Pasteur de Lille. The genetic program was partly supported by an agreement between INSERM and the Merck Sharpe and Dohme Chibret Company. S.-M.Herrmann was supported by a grant from the Deutsche Forschungsgemeinschaft (HE 2852/1-1).
| FOOTNOTES |
|---|
+ To whom correspondence should be addressed. Tel: +33 1 40 77 96 70; Fax: +33 1 40 77 97 28; Email: tiret@idf.inserm.fr Present address:Stefan-Martin Herrmann, Institute of Clinical Pharmacology and Toxicology, Department of Clinical Pharmacology, Freie Universität Berlin, Germany
| REFERENCES |
|---|
|
|
|---|
1 Abecasis,G., Noguchi,E., Heinzmann,A., Traherne,J., Bhattacharyya,S., Leaves,N., Anderson,G., Zhang,Y., Lench,N., Carey,A. et al. (2001) Extent and distribution of linkage disequilibrium in three genomic regions. Am. J. Hum. Genet., 68, 191197.[Web of Science][Medline]
2
Huttley,G., Smith,M., Carrington,M. and OBrien,S. (1999) A scan for linkage disequilibrium across the human genome. Genetics, 152, 17111722.
3
Moffatt,M., Traherne,J., Abecasis,G. and Cookson,W. (2000) Single nucleotide polymorphism and linkage disequilibrium within the TCR
/
locus. Hum. Mol. Genet., 9, 10111019.
4 Taillon-Miller,P., Bauer-Sardina,I., Saccone,N., Putzel,J., Laitinen,T., Cao,A., Kere,J., Pilia,G., Rice,J. and Kwok,P. (2000) Juxtaposed regions of extensive and minimal linkage disequilibrium in human Xq25 and Xq28. Nat. Genet., 25, 324328.[Web of Science][Medline]
5
Zavattari,P., Deidda,E., Whalen,M., Lampis,R., Mulargia,A., Loddo,M., Eaves,I., Mastio,G., Todd,J. and Cucca,F. (2000) Major factors influencing linkage disequilibrium by analysis of different chromosome regions in distinct populations: demography, chromosome recombination frequency and selection. Hum. Mol. Genet., 9, 29472957.
6
Service,S., Ophoff,R. and Freimer,N. (2001) The genome-wide distribution of background linkage disequilibrium in a population isolate. Hum. Mol. Genet., 10, 545551.
7 Eaves,I., Merriman,T., Barber,R., Nutland,S., Tuomilehto-Wolf,E., Tuomilehto,J., Cucca,F. and Todd,J. (2000) The genetically isolated populations of Finland and Sardinia may not be a panacea for linkage disequilibrium mapping of common disease genes. Nat. Genet., 25, 320323.[Web of Science][Medline]
8 Dunning,A., Durocher,F., Healey,C., Teare,M., McBride,S., Carlomagno,F., Xu,C., Dawson,E., Rhodes,S., Ueda,S. et al. (2000) The extent of linkage disequilibrium in four populations with distinct demographic histories. Am. J. Hum. Genet., 67, 15441554.[Web of Science][Medline]
9 Goddard,K., Hopkins,P., Hall,J. and Witte,J. (2000) Linkage disequilibrium and allele-frequency distributions for 114 single-nucleotide polymorphisms in five populations. Am. J. Hum. Genet., 66, 216234.[Web of Science][Medline]
10 Laan,M. and Pääbo,S. (1997) Demographic history and linkage disequilibrium in human populations. Nat. Genet., 17, 435438.[Web of Science][Medline]
11
Peterson,R.J., Goldman,D. and Long,J.C. (1999) Effects of worldwide population subdivision on ALDH2 linkage disequilibrium. Genome Res., 9, 844852.
12 Reich,D., Cargill,M., Bolk,S., Ireland,J., Sabeti,P., Richter,D., Lavery,T., Kouyoumjian,R., Farhadian,S., Ward,R. and Lander,E. (2001) Linkage disequilibrium in the human genome. Nature, 411, 199204.[Medline]
13
Eisenbarth,I., Striebel,A., Moschgath,E., Vogel,W. and Assum,G. (2001) Long-range sequence composition mirrors linkage disequilibrium pattern in a 1.13 Mb region of human chromosome 22. Hum. Mol. Genet., 10, 28332839.
14
Collins,F.S., Patrinos,A., Jordan,E., Chakravarti,A., Gesteland,R. and Walters,L. (1998) New goals for the U.S. Human Genome Project: 19982003. Science, 282, 682689.
15
Collins,F., Guyer,M. and Chakravarti,A. (1997) Variations on a theme: cataloging human DNA sequence variation. Science, 278, 15801581.
16
Collins,A., Lonjou,C. and Morton,N.E. (1999) Genetic epidemiology of single-nucleotide polymorphisms. Proc. Natl Acad. Sci. USA, 96, 1517315177.
17 Kruglyak,L. (1999) Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet., 22, 139144.[Web of Science][Medline]
18 The International SNP Map Working Group. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928933.[Medline]
19 Johnson,G., Esposito,L., Barratt,B., Smith,A., Heward,J., Di Genova,G., Ueda,H., Cordell,H., Eaves,I., Dudbridge,F. et al. (2001) Haplotype tagging for the identification of common disease genes. Nat. Genet., 29, 233237.[Web of Science][Medline]
20 Cargill,M., Altshuler,D., Ireland,J., Sklar,P., Ardlie,K., Patil,N., Shaw,N., Lane,C., Lim,E., Kalyanaraman,N. et al. (1999) Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet., 22, 231238.[Web of Science][Medline]
21 Halushka,M., Fan,J., Bentley,K., Hsie,L., Shen,N., Weder,A., Cooper,R., Lipshutz,R. and Chakravarti,A. (1999) Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet., 22, 239247.[Web of Science][Medline]
22 Cambien,F., Poirier,O., Nicaud,V., Herrmann,S., Mallet,C., Ricard,S., Behague,I., Hallet,H., Blanc,H., Loukaci,V. et al. (1999) Sequence diversity in 36 candidate genes for cardiovascular disorders. Am. J. Hum. Genet., 65, 183191.[Web of Science][Medline]
23 Nickerson,D.A., Taylor,S.L., Weiss,K.M., Clark,A.G., Hutchinson,R.G., Stengard,J., Salomaa,V., Vartiainen,E., Boerwinkle,E. and Sing,C.F. (1998) DNA sequence diversity in a 9.7-kb region of the human lipoprotein lipase gene. Nat. Genet., 19, 233240.[Web of Science][Medline]
24 Fullerton,S., Clark,A., Weiss,K., Nickerson,D., Taylor,S., Stengard,J., Salomaa,V., Vartiainen,E., Perola,M., Boerwinkle,E. and Sing,C. (2000) Apolipoprotein E variation at the sequence haplotype level: implications for the origin and maintenance of a major human polymorphism. Am. J. Hum. Genet., 67, 881900.[Web of Science][Medline]
25 Ohnishi,Y., Tanaka,T., Yamada,R., Suematsu,K., Minami,M., Fujii,K., Hoki,N., Kodama,K., Nagata,S., Hayashi,T. et al. (2000) Identification of 187 single nucleotide polymorphisms (SNPs) among 41 candidate genes for ischemic heart disease in the Japanese population. Hum. Genet., 106, 288292.[Web of Science][Medline]
26 Rieder,M.J., Taylor,S.L., Clark,A.G. and Nickerson,D.A. (1999) Sequence variation in the human angiotensin converting enzyme. Nat. Genet., 22, 5962.[Web of Science][Medline]
27 Bonnen,P., Story,M., Ashorn,C., Buchholz,T., Weil,M. and Nelson,D. (2000) Haplotypes at ATM identify coding-sequence variation and indicate a region of extensive linkage disequilibrium. Am. J. Hum. Genet., 67, 14371451.[Web of Science][Medline]
28 Kidd,J.R., Pakstis,A.J., Zhao,H., Lu,R.B., Okonofua,F.E., Odunsi,A., Grigorenko,E., Tamir,B.B., Friedlaender,J., Schulz,L.O. et al. (2000) Haplotypes and linkage disequilibrium at the Phenylalanine Hydroxylase locus, PAH, in a global representation of populations. Am. J. Hum. Genet., 66, 18821899.[Web of Science][Medline]
29
Koch,H., McClay,J., Loh,E., Higuchi,S., Zhao,J., Sham,P., Ball,D. and Craig,I. (2000) Allele association studies with SSR and SNP markers at known physical distances within a 1 Mb region embracing the ALDH2 locus in the Japanese, demonstrates linkage disequilibrium extending up to 400 kb. Hum. Mol. Genet., 9, 29932999.
30 Clark,A., Weiss,K., Nickerson,D., Taylor,S., Buchanan,A., Stengard,J., Salomaa,V., Vartiainen,E., Perola,M., Boerwinkle,E. and Sing,C. (1998) Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am. J. Hum. Genet., 63, 595612.[Web of Science][Medline]
31
Stephens,J., Schneider,J., Tanguay,D., Choi,J., Acharya,T., Stanley,S., Jiang,R., Messer,C., Chew,A., Han,J. et al. (2001) Haplotype variation and linkage disequilibrium in 313 human genes. Science, 293, 489493.
32
Parra,H., Arveiler,D., Evans,A., Cambou,J., Amouyel,P., Bingham,A., McMaster,D., Schaffer,P., Douste-Blazy,P., Luc,G. et al. (1992) A case-control study of lipoprotein particles in two populations at contrasting risk for coronary heart disease. The ECTIM Study. Arterioscler. Thromb., 12, 701707.
33 Poirier,O., Georges,J., Ricard,S., Arveiler,D., Ruidavets,J., Luc,G., Evans,A., Cambien,F. and Tiret,L. (1998) New polymorphisms of the angiotensin II type 1 receptor gene and their associations with myocardial infarction and blood pressure: the ECTIM Study. J. Hypertens., 16, }14431447.
34 International Human Genome Sequencing Consortium. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921.[Medline]
35
Henikoff,S. and Henikoff,J. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 1091510919.
36 Devlin,B. and Risch,N. (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics, 29, 311322.[Web of Science][Medline]
37
Lewontin,R. (1988) On measures of gametic disequilibrium. Genetics, 120, 849852.
38 Thompson,E., Deeb,S., Walker,D. and Motulsky,A. (1988) The detection of linkage disequilibrium between closely linked markers: RFLPs at the AI-CIII apolipoprotein genes. Am. J. Hum. Genet., 42, 113124.[Web of Science][Medline]
39
Wang,D., Fan,J.B., Siao,C.J., Berno,A., Young,P., Sapolsky,R., Ghandour,G., Perkins,N., Winchester,E., Spencer,J. et al. (1998) Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science, 280, 10771082.
40 Li,W., Wu,C. and Luo,C. (1985) A new method for estimating synonymous and nonsynonymous rates of nucleotide substitution considering the relative likelihood of nucleotide and codon changes. Mol. Biol. Evol., 2, 150174.[Abstract]
41
Behague,I., Poirier,O., Nicaud,V., Evans,A., Arveiler,D., Luc,G., Cambou,J., Scarabin,P., Bara,L., Green,F. and Cambien,F. (1996) ß fibrinogen gene polymorphisms are associated with plasma fibrinogen and coronary artery disease in patients with myocardial infarction. The ECTIM study. Circulation, 93, 440449.
42 Muller-Myhsok,B. and Abel,L. (1997) Genetic analysis of complex diseases. Science, 275, 13281329.
43 Abecasis,G., Cookson,W. and Cardon,L. (2001) The power to detect linkage disequilibrium with quantitative traits in selected samples. Am. J. Hum. Genet., 68, 14631474.[Web of Science][Medline]
44 Marth,G., Yeh,R., Minton,M., Donaldson,R., Li,Q., Duan,S., Davenport,R., Miller,R. and Kwok,P. (2001) Single-nucleotide polymorphisms in the public domain: how useful are they? Nat. Genet., 27, 371372.[Web of Science][Medline]
45 Zhu,X., McKenzie,C., Forrester,T., Nickerson,D., Broeckel,U., Schunkert,H., Doering,A., Jacob,H., Cooper,R. and Rieder,M. (2000) Localization of a small genomic region associated with elevated ACE. Am. J. Hum. Genet., 67, 11441153.[Web of Science][Medline]
46 Nei,M. (1987) Molecular Evolutionary Genetics. Columbia University Press, New York, NY.
47 Tiret,L., Amouyel,P., Rakotovao,R., Cambien,F. and Ducimetière,P. (1991) Testing for association between disease and linked marker loci : a log-linear model analysis. Am. J. Hum. Genet., 48, 926934.[Web of Science][Medline]
48 Hill,W. and Robertson,A. (1968) Linkage disequilibrium in finite populations. Theor. Appl. Genet., 38, 226231.
49 Schneider,S., Roessli,D. and Excouffier,L. (2000) Arlequin ver.2.000: A software for population genetics data analysis. In Genetics and Biometry Laboratory, University of Geneva, Switzerland.
50 Risch,N. (2000) Searching for genetic determinants in the new millenium. Nature, 405, 847856.[Medline]
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
I. Mokrousov, N. Sapozhnikova, and O. Narvskaya Mycobacterium tuberculosis co-existence with humans: making an imprint on the macrophage P2X7 receptor gene? J. Med. Microbiol., May 1, 2008; 57(5): 581 - 584. [Abstract] [Full Text] [PDF] |
||||
![]() |
M Persico, M Capasso, R Russo, E Persico, L Croce, C Tiribelli, and A Iolascon Elevated expression and polymorphisms of SOCS3 influence patient response to antiviral therapy in chronic hepatitis C Gut, April 1, 2008; 57(4): 507 - 515. [Abstract] [Full Text] [PDF] |
||||
![]() |
J-R Long, H Xu, L-J Zhao, P-Y Liu, H Shen, Y-J Liu, D-H Xiong, P Xiao, Y-Z Liu, V Dvornyk, et al. The oestrogen receptor {alpha} gene is linked and/or associated with age of menarche in different ethnic groups J. Med. Genet., October 1, 2005; 42(10): 796 - 800. [Full Text] [PDF] |
||||
![]() |
L. Tiret, T. Godefroy, E. Lubos, V. Nicaud, D.-A. Tregouet, S. Barbaux, R. Schnabel, C. Bickel, C. Espinola-Klein, O. Poirier, et al. Genetic Analysis of the Interleukin-18 System Highlights the Role of the Interleukin-18 Gene in Cardiovascular Disease Circulation, August 2, 2005; 112(5): 643 - 650. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Ninio, D. Tregouet, J.-L. Carrier, D. Stengel, C. Bickel, C. Perret, H. J. Rupprecht, F. Cambien, S. Blankenberg, and L. Tiret Platelet-activating factor-acetylhydrolase and PAF-receptor gene haplotypes in relation to future cardiovascular event in patients with coronary artery disease Hum. Mol. Genet., July 1, 2004; 13(13): 1341 - 1351. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Blons, S. Gad, F. Zinzindohoue, I. Maniere, J. Beauregard, D. Tregouet, D. Brasnu, P. Beaune, O. Laccourreye, and P. Laurent-Puig Matrix Metalloproteinase 3 Polymorphism: A Predictive Factor of Response to Neoadjuvant Chemotherapy in Head and Neck Squamous Cell Carcinoma Clin. Cancer Res., April 15, 2004; 10(8): 2594 - 2599. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Jaakkola, A. M. Crane, K. Laiho, I. Herzberg, A.-M. Sims, L. Bradbury, A. Calin, S. Brophy, M. Kauppi, K. Kaarela, et al. The effect of transforming growth factor {beta}1 gene polymorphisms in ankylosing spondylitis Rheumatology, January 1, 2004; 43(1): 32 - 38. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Freudenberg-Hua, J. Freudenberg, N. Kluck, S. Cichon, P. Propping, and M. M. Nothen Single Nucleotide Variation Analysis in 65 Candidate Genes for CNS Disorders in a Representative Sample of the European Population Genome Res., October 1, 2003; 13(10): 2271 - 2276. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. C.J. Twells, C. A. Mein, M. S. Phillips, J. F. Hess, R. Veijola, M. Gilbey, M. Bright, M. Metzker, B. A. Lie, A. Kingsnorth, et al. Haplotype Structure, LD Blocks, and Uneven Recombination Within the LRP5 Gene Genome Res., May 1, 2003; 13(5): 845 - 855. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Funke-Kaiser, F. Reichenberger, K. Kopke, S.-M. Herrmann, J. Pfeifer, H.-D. Orzechowski, W. Zidek, M. Paul, and E. Brand Differential binding of transcription factor E2F-2 to the endothelin-converting enzyme-1b promoter affects blood pressure regulation Hum. Mol. Genet., February 15, 2003; 12(4): 423 - 433. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Almasy and J.W. MacCluer Association Studies of Vascular Phenotypes: How and Why? Arterioscler. Thromb. Vasc. Biol., July 1, 2002; 22(7): 1055 - 1057. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||














