Skip Navigation


Human Molecular Genetics Advance Access originally published online on November 3, 2004
Human Molecular Genetics 2005 14(1):59-69; doi:10.1093/hmg/ddi006
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Supplementary Material
Right arrow All Versions of this Article:
14/1/59    most recent
ddi006v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (30)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bhangale, T. R.
Right arrow Articles by Nickerson, D. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bhangale, T. R.
Right arrow Articles by Nickerson, D. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Human Molecular Genetics, Vol. 14, No. 1 © Oxford University Press 2005; all rights reserved

Comprehensive identification and characterization of diallelic insertion–deletion polymorphisms in 330 human candidate genes

Tushar R. Bhangale1, Mark J. Rieder2, Robert J. Livingston2 and Deborah A. Nickerson1,2,*

1Department of Bioengineering and 2Department of Genome Sciences, University of Washington, Seattle, WA, USA

* To whom correspondence should be addressed at: Department of Genome Sciences, University of Washington, PO Box 357730, Seattle, WA 98195-7730, USA. Tel: +1 2066857387; Fax: +1 2062216498; Email: debnick{at}u.washington.edu

Received August 19, 2004; Accepted October 22, 2004


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 SUPPLEMENTARY MATERIAL
 REFERENCES
 
Despite being the second most frequent type of polymorphism in the genome, diallelic insertion–deletion polymorphisms (indels) have received far less attention in the study of sequence variation. In this report, we describe an approach that can detect indels in the heterozygous state and can comprehensively identify indels in the target sequence. Using this approach, we identified 2393 indels in a set of 330 candidate genes, i.e. an average of seven indels per gene with about two indels per gene being common (minor allele frequency ≥0.1). We compared the population genetic characteristics of indels with substitutions in this data. Our data supported the findings that deletions occur more frequently in the human genome. 5'-UTR and coding regions of the genes showed a significantly lower diversity for indels compared with other regions, suggesting differences in effects of selection on indels and substitutions. Sequence diversity and pairwise linkage disequilibrium (LD) findings of the different populations were similar to earlier results and included a greater skew towards low-frequency variants and a faster rate of LD decay in the African-descent population compared with the non-African populations. Within populations, the allele frequency spectra and LD-decay profiles for indels were similar to substitutions. Overall, the findings suggest that, although the mechanisms giving rise to indels may be different from those causing substitutions, the evolutionary histories of indels and substitutions are similar, and that indels can play a valuable role in association studies and marker selection strategies.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 SUPPLEMENTARY MATERIAL
 REFERENCES
 
Advances in high-throughput genotyping technology and the discovery of millions of single nucleotide polymorphisms are rapidly being translated into high density linkage disequilibrium (LD) maps of the human genome (1Go–5Go). To date, substitution type polymorphisms have been the focus of most genetic variation discovery efforts (6Go–8Go). High-density genetic maps mainly comprising substitutions are emerging as a valuable resource for identifying the genes or regions of the genome associated with common human diseases. Determining the genetic basis behind phenotypic traits will require greater knowledge of the sequence variation and the heritable association between variable sites known as LD in the human genome. It is known that there is a great deal of variation in the extent of LD in the human genome (2Go,4Go,9Go,10Go), and in the regions of low-local LD, a very dense genetic map is required to map a disease locus (11Go). Although substitutions alone may provide adequate density, another available set of valuable markers is short diallelic insertion–deletion polymorphisms (indels). These markers form a fraction of diallelic markers in the human genome, and their inclusion could serve as a valuable resource to improve marker density within a target region if their characteristics were more fully understood in comparison to substitution polymorphisms in the human genome.

A number of previous studies have described large-scale identification of short diallelic indels by examining aligned reference sequences of individuals in the sample population for the presence of high-quality insertion or deletion type of mismatches (12Go–16Go). Most of the recent methods used for identification of indels involve PCR-amplification of the target DNA followed by resequencing and base-calling to determine the corresponding sequences of the individuals in the sample. Indels are then identified by identifying gaps in the alignment of these sequences. Thus, the identification of an indel relies on production of a gap resulting from the difference in the lengths of the two alleles in the sample sequences. However, as most minor alleles are not frequent enough to be observed in a homozygous genotype in the sample, most such comparisons involve sequences of homozygotes for the major allele and heterozygotes. Because heterozygotes have a shift in the sequences of the two alleles, they produce a complex signal on the chromatograms (see below) and the base-calling software cannot correctly determine their sequences in the region of frame shift. As a consequence, the alignment of sequences of heterozygotes with those of homozygotes for the major allele fails to reveal a gap. Therefore, the identification of indels using the earlier approach requires the presence of individuals homozygous for both long and short alleles in the sampled sequences and fails to identify a large fraction of indels in the population. Owing to the errors in base-calling and sequence alignments, the confirmation rate of indels detected by identification of gaps in alignments decreases as the length of the indel decreases (12Go). This makes it impossible to reliably identify 1 bp indels which, as we show, limits the analysis of indel markers. Thus, there has not been an attempt to study sequence variation and LD characteristics of indels identified in an unbiased and comprehensive manner. An assessment of these characteristics is important for evaluating the viability of these markers in association studies and to understand the nature of population genetic forces that influence the sequence variation. In this report, we describe that, similar to substitution polymorphisms, a fluorescence sequencing based approach can be successfully employed to detect the full spectrum of indels in the target sequences. We describe the population genetic characteristics of gene-associated diallelic indels identified using this approach and compare these to the substitution polymorphisms identified in the same genomic regions.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 SUPPLEMENTARY MATERIAL
 REFERENCES
 
Sequence analysis and detection of indels
We resequenced a total of 6495 kb of reference sequence for 330 human candidate genes. Approximately 40% of the genes (127) were involved in inflammation, lipid metabolism and blood pressure regulation and were resequenced in 24 individuals of African-descent (AD) and 23 individuals of European-descent (ED). The remaining genes (203) were involved in DNA repair and cell cycle pathways and were resequenced in 90 individuals from the polymorphism discovery resource (PDR) panel (17Go). Collectively, these genes were distributed across all autosomes and the X chromosome. Fraction of the total length shared by exonic, intronic, 5' flanking and 3' flanking regions of the genes were approximately 7.5, 82.1, 5.9 and 4.5%, respectively.

Our approach to indel detection is based on the fact that it is possible to detect an indel polymorphism as a heterozygote, as the difference in the lengths of the two alleles gives rise to a shift in the sequence of one allele relative to the other. Fluorescence sequencing measures the relative incorporation of each nucleotide at a specific distance from the sequence primer. The signal intensities in the chromatogram are contributed by both the alleles. Therefore, the shift in the sequences of alleles of the heterozygote produces a complex signal on the chromatograms which is characterized by an abrupt drop in the quality of the trace (Fig. 1A). This drop in sequence quality for an individual compared with homozygotes of high quality is a distinctive feature of the sequences containing an indel, and aids in their detection. The complex signal produced by heterozygotes displays multiple heterozygous peaks (Fig. 1B) compared with the homozygous sequences (Fig. 1C and D). Each of these heterozygous peaks in the indel is similar to that detected for substitution polymorphisms: there is an ~50% drop in the height of the primary peak compared with that of a homozygous individual, and there is the presence of a secondary peak corresponding to the base in the alternate allele (18Go–20Go). This pattern of heterozygous peaks at mismatched bases in the two alleles extends until the end of the chromatogram. The pattern is reproducible in the traces obtained from heterozygous individuals and can serve as a reliable indicator for the presence of indels in a set of aligned traces.



View larger version (117K):
[in this window]
[in a new window]
 
Figure 1. Characteristic appearance of indels in aligned reads and chromatograms. (A) A Consed view of aligned sequences. Individuals marked with arrows (right of the panel) were heterozygotes for GTATT-deletion. The remaining individuals were homozygotes for the long allele in this view. Following the beginning of the indel at consensus position 2709, there is a drop in the quality of reads in heterozygotes, and the quality drops to a point where these become marked as unscanned regions (yellow highlighting). Examples of individual sequence traces for: (B) a heterozygote composed of the long and short alleles, which reveals the characteristic frame shift, (C) a homozygote for the long allele and (D) a homozygote for the short allele.

 
Previous methods used to detect indels require homozygous sequences from both the alleles. In our dataset, we find that only 34.5% of indels showed the presence of homozygotes for both the alternate alleles. In the remaining cases, the homozygotes for the minor alleles were missing in the sample populations. Similarly, 32.68% of substitution polymorphisms discovered in the same region displayed homozygotes for both the alternate alleles. These data demonstrate that the previous methods of indel detection could fail to identify the majority of indels in the regions surveyed.

Distribution of lengths of indel alleles
Using the detection methods described earlier, we identified a total of 33 829 substitutions and 2393 indels in 330 genes. In the set of inflammation, lipid metabolism and blood pressure regulation genes, 12 078 substitutions and 799 indel polymorphisms were identified; and in the DNA repair and cell cycle genes, 21 751 substitutions and 1594 indel polymorphisms were identified. Overall, ~6.6% of the discovered polymorphic sites were indels. Diallelic indels were found once every ~2714 bp, whereas the frequency of substitution polymorphisms was once every ~192 bp.

The range of indel sizes was limited to that which can be captured by PCR amplification and was found to be 1–543 bp with a median of 2 bp. The majority of the discovered indels were short, i.e. ~84% were <5 bp in length. Single base-pair indels were the most common form of diallelic indels and accounted for ~46% of all detected indels. Similar to previous findings (12Go), the frequencies of 2, 3 and 4 bp indels were approximately equal and after 4 bp indels, the frequencies decreased as the length of the indels increased (Table 1).


View this table:
[in this window]
[in a new window]
 
Table 1. Distributions of the lengths of indels classified as deletions and insertions based on the ancestral allele
 
A number of previous studies have demonstrated that most human diallelic polymorphisms, including indels, arose after the divergence of the common ancestors to humans and apes, and, that nearly all indels are monomorphic in chimpanzees and gorillas (12Go,21Go). Therefore, in most cases, the chimpanzee allele represents the ancestral state. To define the ancestral state of the newly identified indels, chimpanzee sequences were aligned with the human consensus sequences. In 4.7% of indels, the flanking sequence around the allele could not be reliably aligned with the chimpanzee reads. The remaining 95.3% of indels were classified into two groups: as an insertion when the chimpanzee allele matched the short human allele and as a deletion when the chimpanzee allele matched the long human allele. Table 1 shows the distributions of allele lengths for the two groups. Shorter insertions showed distributions similar to those of deletions, whereas longer insertions were relatively rare. The ratio of deletion/insertion was ~2.3 : 1.

Allele frequency spectrum
The allele frequency spectrum for substitutions is well documented (22Go). However, the description of allele frequency spectrum for an unbiased dataset of indels is not available. It is important to describe the allele frequency spectrum for all markers because it provides valuable clues to population demography (22Go) and the influences of natural selection (23Go). It also enables the calculation of detection rates for markers with a given population minor allele frequency (MAF) and thus allows calculation of sample sizes required to ascertain polymorphic markers in the population (24Go). In this dataset, for all three samples (AD, ED and PDR), ~30% of indels were common indels (MAF at least 0.1). We compared the MAF distributions of indels and substitutions. Within each of the AD and ED populations and the PDR panel (Fig. 2A, B and C, respectively), the allele frequency distributions for indels were similar to those for substitutions. When compared with ED, the AD population showed a greater skew towards low-frequency variants for both types of markers, as expected. As the PDR panel is a mixture of populations, it showed a much greater bias towards low-frequency variants. Similar to the PDR panel, a mixed population constructed from the AD and ED populations showed a strong bias towards low-frequency variants (Fig. 2D). A comparison of the ancestral allele frequency spectra for indels in the AD and ED populations revealed a greater skew towards higher frequency of ancestral allele in the AD population (see Materials and Methods; Supplementary Material, Fig. S1).



View larger version (20K):
[in this window]
[in a new window]
 
Figure 2. Histograms of minor allele frequencies for indels and substitutions in: (A) the AD population, (B) the ED population, (C) the PDR panel and (D) a combined population constructed from the AD and ED populations.

 
In order to evaluate the relative contributions of indels and substitutions to the genetic variation and to compare their extents in different populations, we computed the two summary statistical measures {theta} and {pi} using the two types of markers (Table 2). Under the standard neutral model (random-mating population of constant size with neutral mutations occurring according to infinite sites model) (25Go), the means of the two statistics are equal and thus can provide clues about population demographic history and influences of selection. The AD sample showed significantly higher values of {theta} and {pi} for indel markers compared with the ED sample (ANOVA P-values: 0.001 and 0.002, respectively). As expected for a structured population, for the PDR panel, averages of both the estimators showed values intermediate to those of the AD and ED samples. Values of the two estimators computed using only substitutions showed similar trends (Table 2). In this regard, the magnitude of contribution to overall sequence diversity due to indels was much smaller compared with substitutions, whereas when compared between the populations, indels showed characteristics similar to those of substitutions.


View this table:
[in this window]
[in a new window]
 
Table 2. Means and standard deviations (in parentheses) of values of three summary statistics of sequence diversity: {theta} (x10–4), Watterson's estimator; {pi} (x10–4), Tajima's estimator and Tajima's D computed using different types of markers
 
The influence on the gene function due to an indel type of mutation is expected to be greater than that of a substitution as they give rise to a more severe alteration in the sequence (26Go). As a consequence of the resultant selective pressure, different functional regions of genes may show differences in their sequence diversity characteristics for indels and substitutions. To evaluate these effects, we computed values of the two estimators ({theta} and {pi}) separately for these regions (Supplementary Material, Table S1). As expected, coding regions showed lower values of the two measures than non-coding regions for both indels and substitutions. Ratios of sequence diversity ({pi}) of coding to non-coding regions were found to be significantly lower for indels compared with substitutions in both AD and ED populations as well as the PDR panel (P-values using paired t-tests <10–5). Diversity values of the 5'-UTR regions were found to be significantly lower than those of 3'-UTR regions for indels in the AD (paired t-test P-value 0.03) and ED (P-value 0.004) samples. However, diversity values for the UTR regions computed using only substitutions did not show such significantly lower estimates for the 5'-UTRs (paired t-test P-values: AD, 0.828; ED, 0.599).

A substantial proportion of the sequenced region (~33%) was identified as comprising interspersed repetitive elements (mobile elements such as Alu and LINE) using the program RepeatMasker (http://www.repeatmasker.org). Repetitive sequences exhibit different evolutionary dynamics compared with the unique regions of the genome. Some subfamilies of the repetitive elements are newly incorporated into the genome compared with others and tend to show different levels of diversity. For example, Ta-1 subfamily of LINE-1 repetitive element is younger compared with the Ta-0 family and has replaced the Ta-0 family as the replicatively dominant subfamily (27Go); Alu Sx family is one of the older Alu-subfamilies, whereas Ya5 and Yb8 are new Alu-subfamilies (28Go). We examined the diversity characteristics of these regions and compared them with those of the unique regions. Diversity values for the repetitive regions computed using indels were found to be significantly lower than those for the non-repeat sequences (P-value using paired t-tests for pooled data 0.029 for the AD and ED populations and the PDR panel), whereas an opposite effect was observed for diversity values computed using substitutions (P-value 0.033).

Information from diversity estimates based on allele frequencies can be used to test for natural selection and population expansion. We calculated Tajima's D, a statistic that summarizes information about allele frequency spectrum by comparing the estimates of {pi} and {theta}, using indels and substitutions separately. Within the populations (AD, ED and PDR), values of Tajima's D were similar for indels and substitutions (Table 2). Similar to previous reports on genetic diversity, the AD population, as opposed to the ED population, displayed negative overall average Tajima's D consistent with an excess of low-frequency variants in the population, suggesting a recent expansion of the AD population leading to increased frequency of rare sites (8Go). Negative values were also observed for the PDR panel, possibly due to an excess of low-frequency variants resulting from the mixed sample composition of this population.

Linkage disequilibrium
To study characteristics of pairwise association between diallelic indels and substitution polymorphisms in the sequenced regions and compare them with each other, we examined two statistical measures of association, |D'| and r2. While values of |D'| close to 1 suggest little or no evidence for recombination and |D'| significantly less than 1 implies historical recombination between the markers in the pair, r2 measures the statistical correlation between alleles at two markers and is inversely related to the sample size required to detect phenotypic association at one of the markers in the pair when the other is directly involved in causation of the phenotype (11Go,29Go). We divided the marker pairs into two sets: (i) pairs where one or both of the markers were indels and (ii) pairs that had no indels. As rare alleles are younger, we expect fewer historical recombination events between them, and consequently they tend to display stronger long-range LD (30Go,31Go). Therefore, in this analysis, we only considered high-frequency markers (MAF≥0.2). Figure 3 shows the profiles of LD decay with distance for the two types of marker pairs. LD decay profiles of pairs with indels were similar to those of substitutions in all three samples. The ED sample showed a stronger overall average LD (average |D'| values: AD, 0.542 for indels and 0.551 for substitutions; ED, 0.693 for indels and 0.678 for substitutions) and a slower decay in LD compared with the AD sample (Fig. 3A and B). The PDR panel showed a high overall average (average |D'| values: 0.825 for indels and 0.838 for substitutions) and a slower decay in LD with distance (Fig. 3C) compared with AD and ED populations. This was expected for the PDR panel owing to the population structure, which can generate artifactual LD between unlinked markers (32Go,33Go). Average |D'| values for pairs with indels separated by ≤1 kb were: AD, 0.834; ED, 0.905 and PDR, 0.943, and those separated by >1 kb were: AD, 0.525; ED, 0.677 and PDR, 0.817. Values for marker pairs with only substitutions were similar: average |D'| for pairs separated by ≤1 kb were: AD, 0.872; ED, 0.911 and PDR, 0.948, and those separated by >1kb were: AD, 0.529; ED, 0.659 and PDR, 0.829. Average r2 values also showed similar trends. r2 values above 1/3 can sometimes be regarded as an indication of sufficiently strong LD to be useful for mapping studies (34Go). The approximate ranges of distances up to which indels as well as substitutions displayed ‘useful’ LD were: ED, 9–12 kb; AD, 4–6 kb and PDR, 50–70 kb. In all three populations, indels and substitutions showed similar profiles of decay in the fraction of pairs in complete LD (|D'|=1) with each other (Supplementary Material, Fig. S2).



View larger version (22K):
[in this window]
[in a new window]
 
Figure 3. Comparison of pairwise LD characteristics of indels and substitutions. Plots show values of |D'| and r2 as functions of distance between the marker pairs (each marker with a minor allele frequency ≥0.2) in: (A) the AD population, (B) the ED population and (C) the PDR panel. Plotted values are averages of |D'| and r2 in sliding windows of successive bins of 200 marker pairs with a 150 marker pair overlap.

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 SUPPLEMENTARY MATERIAL
 REFERENCES
 
Large-scale studies involving polymorphism discovery (8Go,30Go,35Go,36Go) routinely employ programs such as PolyPhred (18Go) that analyze trace data to detect heterozygous substitution polymorphisms. We demonstrate that heterozygotes for short diallelic indels can also be identified reliably using the trace data. Our results indicate that identification of heterozygous individuals is crucial to identifying the full spectrum of polymorphic sites in the target sequence and approximately two-thirds of sites (substitutions or indels) will be missed if the detection method does not offer an ability to detect polymorphisms as heterozygotes.

The set of genes used in this study has a genome-wide representation. Extrapolating from this dataset, one in approximately every 15 diallelic polymorphisms (MAF ≥0.01) in the intragenic regions of the human genome is an indel. This estimate is lower than previous reports (12Go,37Go) that used polymorphisms identified in the overlap of clones. The difference may be attributed to the stronger negative selection on indels (see below) compared with substitutions in the intragenic regions owing to their deleterious effect (26Go,38Go). The extragenic regions of the genome may, therefore, have higher-indel density than intragenic regions. Thus, indels constitute a considerable fraction of the diallelic polymorphisms in humans. Inclusion of indel markers could improve the resolution of genetic maps and reveal a more detailed picture of the sequence variation for fine-mapping studies, in addition to improving the accuracy of estimates of recombination rates and statistically inferred haplotypes. All these factors play a vital role in the design of disease association studies (10Go,11Go).

Available procedures for indel detection rely on sequence alignment methods. Therefore, they require the presence of homozygotes for both of the alternate forms of alleles and give low-confirmation rates for short indels. The method described here takes advantage of the characteristic pattern of peaks observed in heterozygous individuals and does not require the presence of homozygotes for the minor alleles in the surveyed sample. The length of the indel has no influence on this pattern of peaks in a heterozygote. Therefore, the method is expected to yield equal confirmation rates for indels of all lengths (up to a maximum length of the PCR product), making it possible to reliably identify short indels. These factors make the method more sensitive compared with the currently used methods (12Go,13Go,15Go). The detection procedure is amenable to automation and can be combined with programs such as PolyPhred that aim at automated analysis of chromatograms. A preliminary version of a computational method to automate the procedure of indel detection has been incorporated in PolyPhred (version 4.05). Once the indels are identified in the sample population, accurate and cost-effective large-scale genotyping in a large number of individuals can be carried out using a number of genotyping methods (39Go,40Go). Similar to previous findings (12Go,41Go), the spectrum of indel lengths was dominated by short indels (1–4 bp). Our results also showed that the insertions (especially longer insertions) are rare when compared with deletions (Table 1). This is expected because there is a thermodynamic asymmetry in the replication slippage mechanisms responsible for insertions and deletions, where insertions require melting of an already replicated DNA segment (26Go). As a consequence, mutation rates of insertions (especially longer insertions) are expected to be relatively lower compared with those of deletions. Higher-mutation rate of deletions compared with insertions is reflected in the deletion/insertion ratios. The ratio of deletion/insertion of 2.3 : 1 was close to that reported by The Human Gene Mutation Database (2.5 : 1) for disease-associated insertion–deletion mutations. Higher rates of deletion compared with insertions in short indels are consistent with the rates estimated in various organisms (42Go–44Go) and support the view that genome loss through small indels is one of the important mechanisms through which the genome sizes evolve (26Go). Long indels (length≥50 bp) formed a very small fraction (<0.6%) of the indels discovered in this study, suggesting that the human genome contains relatively small number of such markers. A recent study has demonstrated that very large indels (>100 kb), also known as large-scale copy number polymorphisms, which cannot be identified by PCR-based methods, also contribute substantially to the human genomic variation and may play a role in disease susceptibility due to their large size, gene content and instability of the corresponding genomic regions (45Go).

Similar to substitutions, {theta} and {pi} values computed using indels showed significantly higher values for the AD compared with the ED population. A similar pattern of sequence variation has been demonstrated in earlier studies and is often attributed to the non-African populations having undergone a severe bottleneck in the recent past (8Go,46Go). AD population is deemed to have undergone a recent expansion (22Go,47Go). In addition, the AD population is an admixed population (i.e. African-Americans). These demographic differences between the AD and the ED populations explain the greater skew towards low-frequency values (Fig. 2A and B) and strong biases towards negative values of Tajima's D for the AD samples compared with ED samples. The greater skew towards high-ancestral allele frequencies in the AD population compared with the ED population (Supplementary Material, Fig. S1) is similar to the patterns observed by Weber et al. (12Go) using indels and Watkins et al. (48Go) using Alu insertion and restriction site polymorphisms, and is consistent with the view that the modern human populations originated in Africa and that the ancestral alleles were preserved within Africa. Similar to the population mixture constructed from the AD and ED populations, the PDR panel, being a mixture of populations, showed a very strong bias towards low-frequency variants in the allele frequency spectra for indels as well as substitutions (Fig. 2C and D). For both the populations, this bias was also reflected in the summary statistics of diversity, as they showed high values of {theta} but low values of {pi} and consequently negative values of Tajima's D (Table 2). Recent population expansion has been suggested as the cause of negative values of Tajima's D in a large-scale survey of 313 genes (8Go). Our results indicate that the population admixture can also explain this trend.

Indels showed a significantly lower ratio of coding to non-coding diversity compared with substitutions due to the selective pressures resulting from the more pronounced deleterious effects of indels in the coding regions compared with those of substitutions. Moreover, comparisons between 5'- and 3'-UTR regions showed that the 5'-UTRs had significantly lower indel diversity, whereas no such effect was observed for substitutions. These findings suggest that indels give rise to more severe functional alterations in the promoter sequences and undergo a stronger negative selective pressure than substitutions. This is likely due to the degenerate nature of consensus transcription factor binding sequences, implying that substitutions alter the binding affinity less than inserting or deleting tracts of promoter sequence. Thus, it would be fruitful to examine indels disrupting transcription factor binding sites to explore functionally relevant polymorphisms in association studies. Indeed, recent reports have indicated that indel polymorphisms in the promoter regions can give rise to potentially pathological alterations in transcriptional activity (49Go,50Go). Gene conversion is known to be one of the major mechanisms giving rise to the sequence diversity within the repetitive elements and the rates of gene conversion may be inversely related to the extent of mismatches between the involved sequences (28Go,51Go). In this dataset, repetitive elements showed a greater diversity for substitutions compared with the unique regions, whereas indels showed an opposite tendency. These findings seem to suggest that the presence of the insertion–deletion type of mismatches may suppress gene-conversion events between repetitive elements.

Efficient disease mapping by LD-based approaches requires availability of markers with appropriate MAF (52Go–54Go). The power to detect an association between the disease and the marker, and the accuracy of resolving the map location of the disease locus, depend on the marker allele distribution in the population (55Go). Within the populations, profiles of decay of pairwise LD for pairs with indels were similar to those for substitutions (Fig. 3). Similarities between the LD characteristics and frequency spectra of indels and substitutions underscore the utility of indel markers in LD-mapping studies. Comparison of the decay profiles of the AD and ED samples supported the extensive earlier evidence that LD decays at a faster rate in the AD population compared with the ED population (9Go,30Go,36Go,56Go). The bottleneck in recent history of non-African populations is often considered as an explanation for the high levels of LD observed in the ED population (57Go,58Go). LD for the PDR panel showed a much slower decay compared with the AD and ED populations. This pattern of LD in the PDR panel can be attributed to several factors: the PDR panel is a stratified population sample and in such populations even unlinked markers may exhibit strong LD (11Go,33Go). The PDR panel is dominated by non-African populations and non-African populations such as Native Americans are expected to show higher levels of LD (59Go). The genes analyzed for the PDR panel are involved in maintaining genomic integrity and loss-of-function mutations in these genes are known to be associated with severe diseases (60Go). These genes are more likely to display effects of selection. Selection against deleterious variations can inflate the observed levels of LD as the deleterious haplotypes are swept from the population (61Go). The range of distances over which LD levels were found useful for association mapping, as defined by Ardlie et al. (34Go), was in agreement with the previous estimates. On the basis of several previous studies, for the ED population, Ardlie et al. (34Go) proposed a range of 10–30 kb, whereas findings suggest that the range is much smaller for African populations (56Go). In our dataset, the ranges were found to be ED, 9–12 kb and AD, 4–6 kb.

Detection of heterozygous samples is indispensable to identification of new polymorphic sites in a population because most alleles are not frequent enough to be observed as homozygotes. Direct sequencing-based typing provides an approach to reliably identify heterozygote indels. On the basis of this survey of an unbiased dataset, short diallelic indels form an appreciable fraction of the diallelic markers in the genome and can be utilized to improve marker density of genome-wide maps. Within a population, the genetic variation due to indels and the LD characteristics of indels show trends similar to those of substitutions. These findings suggest that indels are stable markers, and that along with substitutions markers they can play a valuable role in association studies and marker selection strategies.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 SUPPLEMENTARY MATERIAL
 REFERENCES
 
Sequencing and identification of substitutions
Diallelic indels and substitutions used in this analysis were identified in the variation discovery efforts of two projects: (i) University of Washington—Fred Hutchinson Cancer Research Center's variation discovery efforts funded by the National Heart Lung and Blood Institute's Program in Genomic Applications (PGA) and (ii) the National Institute of Environmental Health Sciences' Environmental Genome Project (EGP). The PGA project is aimed at studying genes involved in inflammatory processes, whereas the EGP project is aimed at examining genes implicated in DNA repair and cell cycle pathways. A list of the 127 genes from the PGA project and 203 genes from the EGP project analyzed in this work is provided in Supplementary Material, Tables S2 and S3, respectively. The DNA variation data were deposited into the GenBank and dbSNP databases. Genotypes are available through dbSNP (http://www.ncbi.nlm.nih.gov/SNP) or at http://pga.gs.washington.edu and http://egp.gs.washington.edu. DNA samples for the two projects were obtained from Coriell Cell Repository (http://locus.umdnj.edu/ccr). PGA genes were sequenced across two populations: 24 individuals of AD selected from the African-American Human Variation Panel (HD50AA: individuals NA17101–NA17116 and NA17133–NA17140) and 23 individuals of ED from Centre d'Etude du Polymorphisme Humain (CEPH) reference panel DNAs (Coriell Cell Repository numbers: NA06990, NA07019, NA07348, NA07349, NA10830, NA10831, NA10842, NA10843, NA10842–NA10845, NA10848, NA10850–NA10854, NA10857, NA10858, NA10860, NA10861, NA12547, NA12548 and NA12560). The EGP genes were sequenced across 90 individuals representing the US population (European-American, African-American, Mexican-American, Native-American and Asian-American) from the PDR panel (17Go). The expected detection rates for these sample sizes are 99% for sites with MAF of >5% and 87% for sites with MAF >1% (24Go). For each gene, the genomic region spanning the longest reference transcript in LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink), including exons and introns, ~2.5 kb upstream of the gene and ~1.5 kb downstream of the gene was sequenced. A complete description of the resequencing protocol is available at the respective project websites (see above). In brief, overlapping PCR primers were designed to cover the target region with an average amplicon size of ~980 bp and average overlap between amplicons of ~190 bp. The PCR products were sequenced using dye primer and dye terminator chemistry on ABI 3700 and ABI 3730 instruments. Trained analysts assembled the sequence data and mapped it onto the reference genomic sequence using Phred (62Go) and Phrap (http://www.phrap.org) and viewed the assemblies using the Consed program (63Go). Substitutions in the data were identified using PolyPhred (18Go) version 4.0 by comparisons of chromatograms trimmed to remove noisy regions, so that the average Phred quality of the examined region was greater than 40. The substitutions discovered by PolyPhred were reviewed by analysts to rule out false positives resulting from biochemical artifacts, noisy trace data and effects of surrounding sequence. The error rate for genotyping by sequence analysis is well below 1% (64Go, unpublished data).

Identification of indels
The sequence data generated by the procedure were analyzed for presence of diallelic indels. The aligned traces were scanned for presence of a characteristic pattern produced by heterozygote indels. The mismatched bases in the two alleles of a heterozygote produce a complex signal on the trace which can be readily distinguished from the homozygotes by the presence of signals corresponding to both the alleles and an abrupt drop in the quality of the read. Indels in the sequence were identified by the presence of multiple heterozygous sites. Once identified, the length of the indel was inferred from the pattern of peaks by performing a pairwise alignment of bases corresponding to the two allelic sequences.

Determination of the ancestral allele
A single-chimpanzee DNA sample was resequenced across all the genes. The chromatograms were analyzed using Phred for base-calling and quality determination. The resulting reads were aligned with the human consensus sequence using the program cross_match (http://www.phrap.org). The resulting alignments were inspected and edited for accuracy. The chimpanzee consensus sequence was thus generated for every gene. This sequence was then compared with the human consensus sequence to determine the ancestral state of the indel allele.

Population genetic analysis
As the number of segregating sites discovered in the sample is highly dependent on sample size, two commonly used moment estimators of sequence diversity are: (1) Watterson's estimator ({theta}) (65Go), which is an estimate of the expected per-site nucleotide heterozygosity based on the number of segregating sites and sample size and (2) Tajima's estimator ({pi}) (66Go), which computes the frequency with which any two random sequences in the sample differ at a site. For each population (AD, ED and PDR panel), values of {theta} and {pi} were calculated using indels and substitutions separately. The test statistics Tajima's D (67Go) was calculated and used to compare the two summary measures and to examine characteristics of the allele frequency spectra.

In order to compare the degree of association (LD) between marker pairs containing indels with those containing only substitutions, two descriptive statistical measures of association, |D'| and r2 (11Go,68Go,69Go), were computed for the different populations. All the earlier statistics were calculated using the computer programs provided by the Kruglyak Lab (http://www.fhcrc.org/labs/kruglyak). The method used an EM-algorithm based approach to infer haplotype frequencies for the pairs of markers required to calculate the LD measures.


    SUPPLEMENTARY MATERIAL
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 SUPPLEMENTARY MATERIAL
 REFERENCES
 
Supplementary Material is available at HMG Online.


    ACKNOWLEDGEMENTS
 
We would like to thank the members of the laboratory of D.A.N. for their efforts in discovery and cataloging of the variation data: Q. Yi, D. Carrington, C. Hastings, E. Calhoun, J. Smith, T. Shaffer, M. Ozuna, S. Da Ponte, N. Rajkumar, M. Wong, P. Keyes, C. Poel, B. Borrayo, M. Montoya, E. Torskey, M. Wook-Chung, D. Nguyen, K. Sherwood, M. Daniels, C. Nguyen, B. Howie, P. Lee, R. Mackelprang, P. Robertson, W. Shackwitz, A. Sherwood, A. Olson, J.T. Jackson, T. Ritchie, B. Leithauser, J. Sloan, E. Toth, L. Witrak, S. Kuldanek and T. Armel. We would also like to thank J. Akey, C. Carlson, D. Crawford and R. Mackelprang for critical reading of this manuscript. This work was supported by grants from the National Heart Lung and Blood Institute PGA (U01 HL66682) (D.A.N. and M.J.R.) and the National Institute of Environmental Health Sciences (NO1 ES-15478) (D.A.N. and M.J.R.).


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 RESULTS
 DISCUSSION
 MATERIALS AND METHODS
 SUPPLEMENTARY MATERIAL
 REFERENCES
 

  1. Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P. et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294, 1719–1723.[Abstract/Free Full Text]

  2. Dawson, E., Abecasis, G.R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D.M., Pabial, J., Dibling, T., Tinsley, E., Kirby, S. et al. (2002) A first-generation linkage disequilibrium map of human chromosome 22. Nature, 418, 544–548.[CrossRef][Medline]

  3. The International HapMap Consortium (2003) The International HapMap Project. Nature, 426, 789–796.[CrossRef][Medline]

  4. Phillips, M.S., Lawrence, R., Sachidanandam, R., Morris, A.P., Balding, D.J., Donaldson, M.A., Studebaker, J.F., Ankener, W.M., Alfisi, S.V., Kuo, F.S. et al. (2003) Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat. Genet., 33, 382–387.[CrossRef][ISI][Medline]

  5. Ke, X., Hunt, S., Tapper, W., Lawrence, R., Stavrides, G., Ghori, J., Whittaker, P., Collins, A., Morris, A.P., Bentley, D. et al. (2004) The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet., 13, 577–588.[Abstract/Free Full Text]

  6. Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. et al. (2001) The sequence of the human genome. Science, 291, 1304–1351.[Abstract/Free Full Text]

  7. Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L. et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928–933.[CrossRef][Medline]

  8. Stephens, J.C., Schneider, J.A., Tanguay, D.A., Choi, J., Acharya, T., Stanley, S.E., Jiang, R., Messer, C.J., Chew, A., Han, J.H. et al. (2001) Haplotype variation and linkage disequilibrium in 313 human genes. Science, 293, 489–493.[Abstract/Free Full Text]

  9. Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. (2002) The structure of haplotype blocks in the human genome. Science, 296, 2225–2229.[Abstract/Free Full Text]

  10. Clark, A.G., Nielsen, R., Signorovitch, J., Matise, T.C., Glanowski, S., Heil, J., Winn-Deen, E.S., Holden, A.L. and Lai, E. (2003) Linkage disequilibrium and inference of ancestral recombination in 538 single-nucleotide polymorphism clusters across the human genome. Am. J. Hum. Genet., 73, 285–300.[CrossRef][ISI][Medline]

  11. Pritchard, J.K. and Przeworski, M. (2001) Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet., 69, 1–14.[CrossRef][ISI][Medline]

  12. Weber, J.L., David, D., Heil, J., Fan, Y., Zhao, C. and Marth, G. (2002) Human diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet., 71, 854–862.[CrossRef][ISI][Medline]

  13. Picoult-Newberg, L., Ideker, T.E., Pohl, M.G., Taylor, S.L., Donaldson, M.A., Nickerson, D.A. and Boyce-Jacino, M. (1999) Mining SNPs from EST databases. Genome Res., 9, 167–174.[Abstract/Free Full Text]

  14. Schmid, K.J., Sorensen, T.R., Stracke, R., Torjek, O., Altmann, T., Mitchell-Olds, T. and Weisshaar, B. (2003) Large-scale identification and analysis of genome-wide single-nucleotide polymorphisms for mapping in Arabidopsis thaliana. Genome Res., 13, 1250–1257.[Abstract/Free Full Text]

  15. Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L. and Kwok, P.Y. (1998) Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res., 8, 748–754.[Abstract/Free Full Text]

  16. Batley, J., Barker, G., O'Sullivan, H., Edwards, K.J. and Edwards, D. (2003) Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol., 132, 84–91.[Abstract/Free Full Text]

  17. Collins, F.S., Brooks, L.D. and Chakravarti, A. (1998) A DNA polymorphism discovery resource for research on human genetic variation. Genome Res., 8, 1229–1231.[Free Full Text]

  18. Nickerson, D.A., Tobe, V.O. and Taylor, S.L. (1997) PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res., 25, 2745–2751.[Abstract/Free Full Text]

  19. Kwok, P.Y., Carlson, C., Yager, T.D., Ankener, W. and Nickerson, D.A. (1994) Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products. Genomics, 23, 138–144.[CrossRef][ISI][Medline]

  20. Parker, L.T., Zakeri, H., Deng, Q., Spurgeon, S., Kwok, P.Y. and Nickerson, D.A. (1996) AmpliTaq DNA polymerase, FS dye-terminator sequencing: analysis of peak height patterns. Biotechniques, 21, 694–699.[ISI][Medline]

  21. Miller, R.D., Taillon-Miller, P. and Kwok, P.Y. (2001) Regions of low single-nucleotide polymorphism incidence in human and orangutan xq: deserts and recent coalescences. Genomics, 71, 78–88.[CrossRef][ISI][Medline]

  22. Marth, G.T., Czabarka, E., Murvai, J. and Sherry, S.T. (2004) The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics, 166, 351–372.[Abstract/Free Full Text]

  23. Bamshad, M. and Wooding, S.P. (2003) Signatures of natural selection in the human genome. Nat. Rev. Genet., 4, 99–111.[CrossRef][ISI][Medline]

  24. Kruglyak, L. and Nickerson, D.A. (2001) Variation is the spice of life. Nat. Genet., 27, 234–236.[CrossRef][ISI][Medline]

  25. Kimura, M. (1969) The rate of molecular evolution considered from the standpoint of population genetics. Proc. Natl Acad. Sci. USA, 63, 1181–1188.[Abstract/Free Full Text]

  26. Petrov, D.A. (2002) Mutational equilibrium model of genome size evolution. Theor. Popul. Biol., 61, 531–544.[CrossRef][ISI][Medline]

  27. Boissinot, S., Chevret, P. and Furano, A.V. (2000) L1 (LINE-1) retrotransposon evolution and amplification in recent human history. Mol. Biol. Evol., 17, 915–928.[Abstract/Free Full Text]

  28. Batzer, M.A. and Deininger, P.L. (2002) Alu repeats and human genomic diversity. Nat. Rev. Genet., 3, 370–379.[CrossRef][ISI][Medline]

  29. Wall, J.D. and Pritchard, J.K. (2003) Haplotype blocks and linkage disequilibrium in the human genome. Nat. Rev. Genet., 4, 587–597.[CrossRef][ISI][Medline]

  30. Reich, D.E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P.C., Richter, D.J., Lavery, T., Kouyoumjian, R., Farhadian, S.F., Ward, R. et al. (2001) Linkage disequilibrium in the human genome. Nature, 411, 199–204.[CrossRef][Medline]

  31. Watterson, G.A. and Guess, H.A. (1977) Is the most frequent allele the oldest? Theor. Popul. Biol., 11, 141–160.[CrossRef][ISI][Medline]

  32. Wilson, J.F. and Goldstein, D.B. (2000) Consistent long-range linkage disequilibrium generated by admixture in a Bantu-Semitic hybrid population. Am. J. Hum. Genet., 67, 926–935.[CrossRef][ISI][Medline]

  33. Pritchard, J.K. and Rosenberg, N.A. (1999) Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet., 65, 220–228.[CrossRef][ISI][Medline]

  34. Ardlie, K.G., Kruglyak, L. and Seielstad, M. (2002) Patterns of linkage disequilibrium in the human genome. Nat. Rev. Genet., 3, 299–309.[CrossRef][ISI][Medline]

  35. Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. (2004) Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet., 36, 700–706.[CrossRef][ISI][Medline]

  36. Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L. and Nickerson, D.A. (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet., 74, 106–120.[CrossRef][ISI][Medline]

  37. Dawson, E., Chen, Y., Hunt, S., Smink, L.J., Hunt, A., Rice, K., Livingston, S., Bumpstead, S., Bruskiewich, R., Sham, P. et al. (2001) A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence. Genome Res., 11, 170–178.[Abstract/Free Full Text]

  38. Krawczak, M., Ball, E.V., Fenton, I., Stenson, P.D., Abeysinghe, S., Thomas, N. and Cooper, D.N. (2000) Human gene mutation database—a biomedical information and research resource. Hum. Mutat., 15, 45–51.[CrossRef][ISI][Medline]

  39. Robledo, R., Beggs, W. and Bender, P. (2003) A simple and cost-effective method for rapid genotyping of insertion/deletion polymorphisms. Genomics, 82, 580–582.[CrossRef][ISI][Medline]

  40. Syvanen, A.C. (2001) Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat. Rev. Genet., 2, 930–942.[CrossRef][ISI][Medline]

  41. Zhang, Z. and Gerstein, M. (2003) Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res., 31, 5338–5348.[Abstract/Free Full Text]

  42. Robertson, H.M. (2000) The large srh family of chemoreceptor genes in Caenorhabditis nematodes reveals processes of genome evolution involving large duplications and deletions and intron gains and losses. Genome Res., 10, 192–203.[Abstract/Free Full Text]

  43. Graur, D., Shuali, Y. and Li, W.H. (1989) Deletions in processed pseudogenes accumulate faster in rodents than in humans. J. Mol. Evol., 28, 279–285.[CrossRef][ISI][Medline]

  44. Petrov, D.A. and Hartl, D.L. (1998) High rate of DNA loss in the Drosophila melanogaster and Drosophila virilis species groups. Mol. Biol. Evol., 15, 293–302.[Abstract]

  45. Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. (2004) Large-scale copy number polymorphism in the human genome. Science, 305, 525–528.[Abstract/Free Full Text]

  46. Armour, J.A., Anttinen, T., May, C.A., Vega, E.E., Sajantila, A., Kidd, J.R., Kidd, K.K., Bertranpetit, J., Paabo, S. and Jeffreys, A.J. (1996) Minisatellite diversity supports a recent African origin for modern humans. Nat. Genet., 13, 154–160.[CrossRef][ISI][Medline]

  47. Harpending, H. and Rogers, A. (2000) Genetic perspectives on human origins and differentiation. Annu. Rev. Genomics Hum. Genet., 1, 361–385.[CrossRef][ISI][Medline]

  48. Watkins, W.S., Ricker, C.E., Bamshad, M.J., Carroll, M.L., Nguyen, S.V., Batzer, M.A., Harpending, H.C., Rogers, A.R. and Jorde, L.B. (2001) Patterns of ancestral human diversity: an analysis of Alu-insertion and restriction-site polymorphisms. Am. J. Hum. Genet., 68, 738–752.[CrossRef][ISI][Medline]

  49. Lin, S.C., Chung, M.Y., Huang, J.W., Shieh, T.M., Liu, C.J. and Chang, K.W. (2004) Correlation between functional genotypes in the matrix metalloproteinases-1 promoter and risk of oral squamous cell carcinomas. J. Oral. Pathol. Med., 33, 323–326.[ISI][Medline]

  50. Karban, A.S., Okazaki, T., Panhuysen, C.I., Gallegos, T., Potter, J.J., Bailey-Wilson, J.E., Silverberg, M.S., Duerr, R.H., Cho, J.H., Gregersen, P.K. et al., (2004) Functional annotation of a novel NFKB1 promoter polymorphism that increases risk for ulcerative colitis. Hum. Mol. Genet., 13, 35–45.[Abstract/Free Full Text]

  51. Salem, A.H., Kilroy, G.E., Watkins, W.S., Jorde, L.B. and Batzer, M.A. (2003) Recently integrated Alu elements and human genomic diversity. Mol. Biol. Evol., 20, 1349–1361.[Abstract/Free Full Text]

  52. Kruglyak, L. (1999) Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet., 22, 139–144.[CrossRef][ISI][Medline]

  53. Kruglyak, L. (1997) The use of a genetic map of biallelic markers in linkage studies. Nat. Genet., 17, 21–24.[CrossRef][ISI][Medline]

  54. Zondervan, K.T. and Cardon, L.R. (2004) The complex interplay among factors that influence allelic association. Nat. Rev. Genet., 5, 89–100.[ISI][Medline]

  55. Xiong, M. and Jin, L. (1999) Comparison of the power and accuracy of biallelic and microsatellite markers in population-based gene-mapping methods. Am. J. Hum. Genet., 64, 629–640.[CrossRef][ISI][Medline]

  56. Frisse, L., Hudson, R.R., Bartoszewicz, A., Wall, J.D., Donfack, J. and Di Rienzo, A. (2001) Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet., 69, 831–843.[CrossRef][ISI][Medline]

  57. Tishkoff, S.A., Dietzsch, E., Speed, W., Pakstis, A.J., Kidd, J.R., Cheung, K., Bonne-Tamir, B., Santachiara-Benerecetti, A.S., Moral, P. and Krings, M. (1996) Global pat