Human Molecular Genetics Advance Access originally published online on November 3, 2004
Human Molecular Genetics 2005 14(1):59-69; doi:10.1093/hmg/ddi006
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Human Molecular Genetics, Vol. 14, No. 1 © Oxford University Press 2005; all rights reserved
Comprehensive identification and characterization of diallelic insertiondeletion polymorphisms in 330 human candidate genes
1Department of Bioengineering and 2Department of Genome Sciences, University of Washington, Seattle, WA, USA
* To whom correspondence should be addressed at: Department of Genome Sciences, University of Washington, PO Box 357730, Seattle, WA 98195-7730, USA. Tel: +1 2066857387; Fax: +1 2062216498; Email: debnick{at}u.washington.edu
Received August 19, 2004; Accepted October 22, 2004
| ABSTRACT |
|---|
|
|
|---|
Despite being the second most frequent type of polymorphism in the genome, diallelic insertiondeletion polymorphisms (indels) have received far less attention in the study of sequence variation. In this report, we describe an approach that can detect indels in the heterozygous state and can comprehensively identify indels in the target sequence. Using this approach, we identified 2393 indels in a set of 330 candidate genes, i.e. an average of seven indels per gene with about two indels per gene being common (minor allele frequency
0.1). We compared the population genetic characteristics of indels with substitutions in this data. Our data supported the findings that deletions occur more frequently in the human genome. 5'-UTR and coding regions of the genes showed a significantly lower diversity for indels compared with other regions, suggesting differences in effects of selection on indels and substitutions. Sequence diversity and pairwise linkage disequilibrium (LD) findings of the different populations were similar to earlier results and included a greater skew towards low-frequency variants and a faster rate of LD decay in the African-descent population compared with the non-African populations. Within populations, the allele frequency spectra and LD-decay profiles for indels were similar to substitutions. Overall, the findings suggest that, although the mechanisms giving rise to indels may be different from those causing substitutions, the evolutionary histories of indels and substitutions are similar, and that indels can play a valuable role in association studies and marker selection strategies. | INTRODUCTION |
|---|
|
|
|---|
Advances in high-throughput genotyping technology and the discovery of millions of single nucleotide polymorphisms are rapidly being translated into high density linkage disequilibrium (LD) maps of the human genome (1
A number of previous studies have described large-scale identification of short diallelic indels by examining aligned reference sequences of individuals in the sample population for the presence of high-quality insertion or deletion type of mismatches (12
16
). Most of the recent methods used for identification of indels involve PCR-amplification of the target DNA followed by resequencing and base-calling to determine the corresponding sequences of the individuals in the sample. Indels are then identified by identifying gaps in the alignment of these sequences. Thus, the identification of an indel relies on production of a gap resulting from the difference in the lengths of the two alleles in the sample sequences. However, as most minor alleles are not frequent enough to be observed in a homozygous genotype in the sample, most such comparisons involve sequences of homozygotes for the major allele and heterozygotes. Because heterozygotes have a shift in the sequences of the two alleles, they produce a complex signal on the chromatograms (see below) and the base-calling software cannot correctly determine their sequences in the region of frame shift. As a consequence, the alignment of sequences of heterozygotes with those of homozygotes for the major allele fails to reveal a gap. Therefore, the identification of indels using the earlier approach requires the presence of individuals homozygous for both long and short alleles in the sampled sequences and fails to identify a large fraction of indels in the population. Owing to the errors in base-calling and sequence alignments, the confirmation rate of indels detected by identification of gaps in alignments decreases as the length of the indel decreases (12
). This makes it impossible to reliably identify 1 bp indels which, as we show, limits the analysis of indel markers. Thus, there has not been an attempt to study sequence variation and LD characteristics of indels identified in an unbiased and comprehensive manner. An assessment of these characteristics is important for evaluating the viability of these markers in association studies and to understand the nature of population genetic forces that influence the sequence variation. In this report, we describe that, similar to substitution polymorphisms, a fluorescence sequencing based approach can be successfully employed to detect the full spectrum of indels in the target sequences. We describe the population genetic characteristics of gene-associated diallelic indels identified using this approach and compare these to the substitution polymorphisms identified in the same genomic regions.
| RESULTS |
|---|
|
|
|---|
Sequence analysis and detection of indels
We resequenced a total of 6495 kb of reference sequence for 330 human candidate genes. Approximately 40% of the genes (127) were involved in inflammation, lipid metabolism and blood pressure regulation and were resequenced in 24 individuals of African-descent (AD) and 23 individuals of European-descent (ED). The remaining genes (203) were involved in DNA repair and cell cycle pathways and were resequenced in 90 individuals from the polymorphism discovery resource (PDR) panel (17
Our approach to indel detection is based on the fact that it is possible to detect an indel polymorphism as a heterozygote, as the difference in the lengths of the two alleles gives rise to a shift in the sequence of one allele relative to the other. Fluorescence sequencing measures the relative incorporation of each nucleotide at a specific distance from the sequence primer. The signal intensities in the chromatogram are contributed by both the alleles. Therefore, the shift in the sequences of alleles of the heterozygote produces a complex signal on the chromatograms which is characterized by an abrupt drop in the quality of the trace (Fig. 1A). This drop in sequence quality for an individual compared with homozygotes of high quality is a distinctive feature of the sequences containing an indel, and aids in their detection. The complex signal produced by heterozygotes displays multiple heterozygous peaks (Fig. 1B) compared with the homozygous sequences (Fig. 1C and D). Each of these heterozygous peaks in the indel is similar to that detected for substitution polymorphisms: there is an
50% drop in the height of the primary peak compared with that of a homozygous individual, and there is the presence of a secondary peak corresponding to the base in the alternate allele (18
20
). This pattern of heterozygous peaks at mismatched bases in the two alleles extends until the end of the chromatogram. The pattern is reproducible in the traces obtained from heterozygous individuals and can serve as a reliable indicator for the presence of indels in a set of aligned traces.
|
Previous methods used to detect indels require homozygous sequences from both the alleles. In our dataset, we find that only 34.5% of indels showed the presence of homozygotes for both the alternate alleles. In the remaining cases, the homozygotes for the minor alleles were missing in the sample populations. Similarly, 32.68% of substitution polymorphisms discovered in the same region displayed homozygotes for both the alternate alleles. These data demonstrate that the previous methods of indel detection could fail to identify the majority of indels in the regions surveyed.
Distribution of lengths of indel alleles
Using the detection methods described earlier, we identified a total of 33 829 substitutions and 2393 indels in 330 genes. In the set of inflammation, lipid metabolism and blood pressure regulation genes, 12 078 substitutions and 799 indel polymorphisms were identified; and in the DNA repair and cell cycle genes, 21 751 substitutions and 1594 indel polymorphisms were identified. Overall,
6.6% of the discovered polymorphic sites were indels. Diallelic indels were found once every
2714 bp, whereas the frequency of substitution polymorphisms was once every
192 bp.
The range of indel sizes was limited to that which can be captured by PCR amplification and was found to be 1543 bp with a median of 2 bp. The majority of the discovered indels were short, i.e.
84% were <5 bp in length. Single base-pair indels were the most common form of diallelic indels and accounted for
46% of all detected indels. Similar to previous findings (12
), the frequencies of 2, 3 and 4 bp indels were approximately equal and after 4 bp indels, the frequencies decreased as the length of the indels increased (Table 1).
|
A number of previous studies have demonstrated that most human diallelic polymorphisms, including indels, arose after the divergence of the common ancestors to humans and apes, and, that nearly all indels are monomorphic in chimpanzees and gorillas (12
2.3 : 1.
Allele frequency spectrum
The allele frequency spectrum for substitutions is well documented (22
). However, the description of allele frequency spectrum for an unbiased dataset of indels is not available. It is important to describe the allele frequency spectrum for all markers because it provides valuable clues to population demography (22
) and the influences of natural selection (23
). It also enables the calculation of detection rates for markers with a given population minor allele frequency (MAF) and thus allows calculation of sample sizes required to ascertain polymorphic markers in the population (24
). In this dataset, for all three samples (AD, ED and PDR),
30% of indels were common indels (MAF at least 0.1). We compared the MAF distributions of indels and substitutions. Within each of the AD and ED populations and the PDR panel (Fig. 2A, B and C, respectively), the allele frequency distributions for indels were similar to those for substitutions. When compared with ED, the AD population showed a greater skew towards low-frequency variants for both types of markers, as expected. As the PDR panel is a mixture of populations, it showed a much greater bias towards low-frequency variants. Similar to the PDR panel, a mixed population constructed from the AD and ED populations showed a strong bias towards low-frequency variants (Fig. 2D). A comparison of the ancestral allele frequency spectra for indels in the AD and ED populations revealed a greater skew towards higher frequency of ancestral allele in the AD population (see Materials and Methods; Supplementary Material, Fig. S1).
|
In order to evaluate the relative contributions of indels and substitutions to the genetic variation and to compare their extents in different populations, we computed the two summary statistical measures
and
using the two types of markers (Table 2). Under the standard neutral model (random-mating population of constant size with neutral mutations occurring according to infinite sites model) (25
and
for indel markers compared with the ED sample (ANOVA P-values: 0.001 and 0.002, respectively). As expected for a structured population, for the PDR panel, averages of both the estimators showed values intermediate to those of the AD and ED samples. Values of the two estimators computed using only substitutions showed similar trends (Table 2). In this regard, the magnitude of contribution to overall sequence diversity due to indels was much smaller compared with substitutions, whereas when compared between the populations, indels showed characteristics similar to those of substitutions.
|
The influence on the gene function due to an indel type of mutation is expected to be greater than that of a substitution as they give rise to a more severe alteration in the sequence (26
and
) separately for these regions (Supplementary Material, Table S1). As expected, coding regions showed lower values of the two measures than non-coding regions for both indels and substitutions. Ratios of sequence diversity (
) of coding to non-coding regions were found to be significantly lower for indels compared with substitutions in both AD and ED populations as well as the PDR panel (P-values using paired t-tests <105). Diversity values of the 5'-UTR regions were found to be significantly lower than those of 3'-UTR regions for indels in the AD (paired t-test P-value 0.03) and ED (P-value 0.004) samples. However, diversity values for the UTR regions computed using only substitutions did not show such significantly lower estimates for the 5'-UTRs (paired t-test P-values: AD, 0.828; ED, 0.599).
A substantial proportion of the sequenced region (
33%) was identified as comprising interspersed repetitive elements (mobile elements such as Alu and LINE) using the program RepeatMasker (http://www.repeatmasker.org). Repetitive sequences exhibit different evolutionary dynamics compared with the unique regions of the genome. Some subfamilies of the repetitive elements are newly incorporated into the genome compared with others and tend to show different levels of diversity. For example, Ta-1 subfamily of LINE-1 repetitive element is younger compared with the Ta-0 family and has replaced the Ta-0 family as the replicatively dominant subfamily (27
); Alu Sx family is one of the older Alu-subfamilies, whereas Ya5 and Yb8 are new Alu-subfamilies (28
). We examined the diversity characteristics of these regions and compared them with those of the unique regions. Diversity values for the repetitive regions computed using indels were found to be significantly lower than those for the non-repeat sequences (P-value using paired t-tests for pooled data 0.029 for the AD and ED populations and the PDR panel), whereas an opposite effect was observed for diversity values computed using substitutions (P-value 0.033).
Information from diversity estimates based on allele frequencies can be used to test for natural selection and population expansion. We calculated Tajima's D, a statistic that summarizes information about allele frequency spectrum by comparing the estimates of
and
, using indels and substitutions separately. Within the populations (AD, ED and PDR), values of Tajima's D were similar for indels and substitutions (Table 2). Similar to previous reports on genetic diversity, the AD population, as opposed to the ED population, displayed negative overall average Tajima's D consistent with an excess of low-frequency variants in the population, suggesting a recent expansion of the AD population leading to increased frequency of rare sites (8
). Negative values were also observed for the PDR panel, possibly due to an excess of low-frequency variants resulting from the mixed sample composition of this population.
Linkage disequilibrium
To study characteristics of pairwise association between diallelic indels and substitution polymorphisms in the sequenced regions and compare them with each other, we examined two statistical measures of association, |D'| and r2. While values of |D'| close to 1 suggest little or no evidence for recombination and |D'| significantly less than 1 implies historical recombination between the markers in the pair, r2 measures the statistical correlation between alleles at two markers and is inversely related to the sample size required to detect phenotypic association at one of the markers in the pair when the other is directly involved in causation of the phenotype (11
,29
). We divided the marker pairs into two sets: (i) pairs where one or both of the markers were indels and (ii) pairs that had no indels. As rare alleles are younger, we expect fewer historical recombination events between them, and consequently they tend to display stronger long-range LD (30
,31
). Therefore, in this analysis, we only considered high-frequency markers (MAF
0.2). Figure 3 shows the profiles of LD decay with distance for the two types of marker pairs. LD decay profiles of pairs with indels were similar to those of substitutions in all three samples. The ED sample showed a stronger overall average LD (average |D'| values: AD, 0.542 for indels and 0.551 for substitutions; ED, 0.693 for indels and 0.678 for substitutions) and a slower decay in LD compared with the AD sample (Fig. 3A and B). The PDR panel showed a high overall average (average |D'| values: 0.825 for indels and 0.838 for substitutions) and a slower decay in LD with distance (Fig. 3C) compared with AD and ED populations. This was expected for the PDR panel owing to the population structure, which can generate artifactual LD between unlinked markers (32
,33
). Average |D'| values for pairs with indels separated by
1 kb were: AD, 0.834; ED, 0.905 and PDR, 0.943, and those separated by >1 kb were: AD, 0.525; ED, 0.677 and PDR, 0.817. Values for marker pairs with only substitutions were similar: average |D'| for pairs separated by
1 kb were: AD, 0.872; ED, 0.911 and PDR, 0.948, and those separated by >1kb were: AD, 0.529; ED, 0.659 and PDR, 0.829. Average r2 values also showed similar trends. r2 values above 1/3 can sometimes be regarded as an indication of sufficiently strong LD to be useful for mapping studies (34
). The approximate ranges of distances up to which indels as well as substitutions displayed useful LD were: ED, 912 kb; AD, 46 kb and PDR, 5070 kb. In all three populations, indels and substitutions showed similar profiles of decay in the fraction of pairs in complete LD (|D'|=1) with each other (Supplementary Material, Fig. S2).
|
| DISCUSSION |
|---|
|
|
|---|
Large-scale studies involving polymorphism discovery (8
The set of genes used in this study has a genome-wide representation. Extrapolating from this dataset, one in approximately every 15 diallelic polymorphisms (MAF
0.01) in the intragenic regions of the human genome is an indel. This estimate is lower than previous reports (12
,37
) that used polymorphisms identified in the overlap of clones. The difference may be attributed to the stronger negative selection on indels (see below) compared with substitutions in the intragenic regions owing to their deleterious effect (26
,38
). The extragenic regions of the genome may, therefore, have higher-indel density than intragenic regions. Thus, indels constitute a considerable fraction of the diallelic polymorphisms in humans. Inclusion of indel markers could improve the resolution of genetic maps and reveal a more detailed picture of the sequence variation for fine-mapping studies, in addition to improving the accuracy of estimates of recombination rates and statistically inferred haplotypes. All these factors play a vital role in the design of disease association studies (10
,11
).
Available procedures for indel detection rely on sequence alignment methods. Therefore, they require the presence of homozygotes for both of the alternate forms of alleles and give low-confirmation rates for short indels. The method described here takes advantage of the characteristic pattern of peaks observed in heterozygous individuals and does not require the presence of homozygotes for the minor alleles in the surveyed sample. The length of the indel has no influence on this pattern of peaks in a heterozygote. Therefore, the method is expected to yield equal confirmation rates for indels of all lengths (up to a maximum length of the PCR product), making it possible to reliably identify short indels. These factors make the method more sensitive compared with the currently used methods (12
,13
,15
). The detection procedure is amenable to automation and can be combined with programs such as PolyPhred that aim at automated analysis of chromatograms. A preliminary version of a computational method to automate the procedure of indel detection has been incorporated in PolyPhred (version 4.05). Once the indels are identified in the sample population, accurate and cost-effective large-scale genotyping in a large number of individuals can be carried out using a number of genotyping methods (39
,40
). Similar to previous findings (12
,41
), the spectrum of indel lengths was dominated by short indels (14 bp). Our results also showed that the insertions (especially longer insertions) are rare when compared with deletions (Table 1). This is expected because there is a thermodynamic asymmetry in the replication slippage mechanisms responsible for insertions and deletions, where insertions require melting of an already replicated DNA segment (26
). As a consequence, mutation rates of insertions (especially longer insertions) are expected to be relatively lower compared with those of deletions. Higher-mutation rate of deletions compared with insertions is reflected in the deletion/insertion ratios. The ratio of deletion/insertion of 2.3 : 1 was close to that reported by The Human Gene Mutation Database (2.5 : 1) for disease-associated insertiondeletion mutations. Higher rates of deletion compared with insertions in short indels are consistent with the rates estimated in various organisms (42
44
) and support the view that genome loss through small indels is one of the important mechanisms through which the genome sizes evolve (26
). Long indels (length
50 bp) formed a very small fraction (<0.6%) of the indels discovered in this study, suggesting that the human genome contains relatively small number of such markers. A recent study has demonstrated that very large indels (>100 kb), also known as large-scale copy number polymorphisms, which cannot be identified by PCR-based methods, also contribute substantially to the human genomic variation and may play a role in disease susceptibility due to their large size, gene content and instability of the corresponding genomic regions (45
).
Similar to substitutions,
and
values computed using indels showed significantly higher values for the AD compared with the ED population. A similar pattern of sequence variation has been demonstrated in earlier studies and is often attributed to the non-African populations having undergone a severe bottleneck in the recent past (8
,46
). AD population is deemed to have undergone a recent expansion (22
,47
). In addition, the AD population is an admixed population (i.e. African-Americans). These demographic differences between the AD and the ED populations explain the greater skew towards low-frequency values (Fig. 2A and B) and strong biases towards negative values of Tajima's D for the AD samples compared with ED samples. The greater skew towards high-ancestral allele frequencies in the AD population compared with the ED population (Supplementary Material, Fig. S1) is similar to the patterns observed by Weber et al. (12
) using indels and Watkins et al. (48
) using Alu insertion and restriction site polymorphisms, and is consistent with the view that the modern human populations originated in Africa and that the ancestral alleles were preserved within Africa. Similar to the population mixture constructed from the AD and ED populations, the PDR panel, being a mixture of populations, showed a very strong bias towards low-frequency variants in the allele frequency spectra for indels as well as substitutions (Fig. 2C and D). For both the populations, this bias was also reflected in the summary statistics of diversity, as they showed high values of
but low values of
and consequently negative values of Tajima's D (Table 2). Recent population expansion has been suggested as the cause of negative values of Tajima's D in a large-scale survey of 313 genes (8
). Our results indicate that the population admixture can also explain this trend.
Indels showed a significantly lower ratio of coding to non-coding diversity compared with substitutions due to the selective pressures resulting from the more pronounced deleterious effects of indels in the coding regions compared with those of substitutions. Moreover, comparisons between 5'- and 3'-UTR regions showed that the 5'-UTRs had significantly lower indel diversity, whereas no such effect was observed for substitutions. These findings suggest that indels give rise to more severe functional alterations in the promoter sequences and undergo a stronger negative selective pressure than substitutions. This is likely due to the degenerate nature of consensus transcription factor binding sequences, implying that substitutions alter the binding affinity less than inserting or deleting tracts of promoter sequence. Thus, it would be fruitful to examine indels disrupting transcription factor binding sites to explore functionally relevant polymorphisms in association studies. Indeed, recent reports have indicated that indel polymorphisms in the promoter regions can give rise to potentially pathological alterations in transcriptional activity (49
,50
). Gene conversion is known to be one of the major mechanisms giving rise to the sequence diversity within the repetitive elements and the rates of gene conversion may be inversely related to the extent of mismatches between the involved sequences (28
,51
). In this dataset, repetitive elements showed a greater diversity for substitutions compared with the unique regions, whereas indels showed an opposite tendency. These findings seem to suggest that the presence of the insertiondeletion type of mismatches may suppress gene-conversion events between repetitive elements.
Efficient disease mapping by LD-based approaches requires availability of markers with appropriate MAF (52
54
). The power to detect an association between the disease and the marker, and the accuracy of resolving the map location of the disease locus, depend on the marker allele distribution in the population (55
). Within the populations, profiles of decay of pairwise LD for pairs with indels were similar to those for substitutions (Fig. 3). Similarities between the LD characteristics and frequency spectra of indels and substitutions underscore the utility of indel markers in LD-mapping studies. Comparison of the decay profiles of the AD and ED samples supported the extensive earlier evidence that LD decays at a faster rate in the AD population compared with the ED population (9
,30
,36
,56
). The bottleneck in recent history of non-African populations is often considered as an explanation for the high levels of LD observed in the ED population (57
,58
). LD for the PDR panel showed a much slower decay compared with the AD and ED populations. This pattern of LD in the PDR panel can be attributed to several factors: the PDR panel is a stratified population sample and in such populations even unlinked markers may exhibit strong LD (11
,33
). The PDR panel is dominated by non-African populations and non-African populations such as Native Americans are expected to show higher levels of LD (59
). The genes analyzed for the PDR panel are involved in maintaining genomic integrity and loss-of-function mutations in these genes are known to be associated with severe diseases (60
). These genes are more likely to display effects of selection. Selection against deleterious variations can inflate the observed levels of LD as the deleterious haplotypes are swept from the population (61
). The range of distances over which LD levels were found useful for association mapping, as defined by Ardlie et al. (34
), was in agreement with the previous estimates. On the basis of several previous studies, for the ED population, Ardlie et al. (34
) proposed a range of 1030 kb, whereas findings suggest that the range is much smaller for African populations (56
). In our dataset, the ranges were found to be ED, 912 kb and AD, 46 kb.
Detection of heterozygous samples is indispensable to identification of new polymorphic sites in a population because most alleles are not frequent enough to be observed as homozygotes. Direct sequencing-based typing provides an approach to reliably identify heterozygote indels. On the basis of this survey of an unbiased dataset, short diallelic indels form an appreciable fraction of the diallelic markers in the genome and can be utilized to improve marker density of genome-wide maps. Within a population, the genetic variation due to indels and the LD characteristics of indels show trends similar to those of substitutions. These findings suggest that indels are stable markers, and that along with substitutions markers they can play a valuable role in association studies and marker selection strategies.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Sequencing and identification of substitutions
Diallelic indels and substitutions used in this analysis were identified in the variation discovery efforts of two projects: (i) University of WashingtonFred Hutchinson Cancer Research Center's variation discovery efforts funded by the National Heart Lung and Blood Institute's Program in Genomic Applications (PGA) and (ii) the National Institute of Environmental Health Sciences' Environmental Genome Project (EGP). The PGA project is aimed at studying genes involved in inflammatory processes, whereas the EGP project is aimed at examining genes implicated in DNA repair and cell cycle pathways. A list of the 127 genes from the PGA project and 203 genes from the EGP project analyzed in this work is provided in Supplementary Material, Tables S2 and S3, respectively. The DNA variation data were deposited into the GenBank and dbSNP databases. Genotypes are available through dbSNP (http://www.ncbi.nlm.nih.gov/SNP) or at http://pga.gs.washington.edu and http://egp.gs.washington.edu. DNA samples for the two projects were obtained from Coriell Cell Repository (http://locus.umdnj.edu/ccr). PGA genes were sequenced across two populations: 24 individuals of AD selected from the African-American Human Variation Panel (HD50AA: individuals NA17101NA17116 and NA17133NA17140) and 23 individuals of ED from Centre d'Etude du Polymorphisme Humain (CEPH) reference panel DNAs (Coriell Cell Repository numbers: NA06990, NA07019, NA07348, NA07349, NA10830, NA10831, NA10842, NA10843, NA10842NA10845, NA10848, NA10850NA10854, NA10857, NA10858, NA10860, NA10861, NA12547, NA12548 and NA12560). The EGP genes were sequenced across 90 individuals representing the US population (European-American, African-American, Mexican-American, Native-American and Asian-American) from the PDR panel (17
2.5 kb upstream of the gene and
1.5 kb downstream of the gene was sequenced. A complete description of the resequencing protocol is available at the respective project websites (see above). In brief, overlapping PCR primers were designed to cover the target region with an average amplicon size of
980 bp and average overlap between amplicons of
190 bp. The PCR products were sequenced using dye primer and dye terminator chemistry on ABI 3700 and ABI 3730 instruments. Trained analysts assembled the sequence data and mapped it onto the reference genomic sequence using Phred (62
Identification of indels
The sequence data generated by the procedure were analyzed for presence of diallelic indels. The aligned traces were scanned for presence of a characteristic pattern produced by heterozygote indels. The mismatched bases in the two alleles of a heterozygote produce a complex signal on the trace which can be readily distinguished from the homozygotes by the presence of signals corresponding to both the alleles and an abrupt drop in the quality of the read. Indels in the sequence were identified by the presence of multiple heterozygous sites. Once identified, the length of the indel was inferred from the pattern of peaks by performing a pairwise alignment of bases corresponding to the two allelic sequences.
Determination of the ancestral allele
A single-chimpanzee DNA sample was resequenced across all the genes. The chromatograms were analyzed using Phred for base-calling and quality determination. The resulting reads were aligned with the human consensus sequence using the program cross_match (http://www.phrap.org). The resulting alignments were inspected and edited for accuracy. The chimpanzee consensus sequence was thus generated for every gene. This sequence was then compared with the human consensus sequence to determine the ancestral state of the indel allele.
Population genetic analysis
As the number of segregating sites discovered in the sample is highly dependent on sample size, two commonly used moment estimators of sequence diversity are: (1) Watterson's estimator (
) (65
), which is an estimate of the expected per-site nucleotide heterozygosity based on the number of segregating sites and sample size and (2) Tajima's estimator (
) (66
), which computes the frequency with which any two random sequences in the sample differ at a site. For each population (AD, ED and PDR panel), values of
and
were calculated using indels and substitutions separately. The test statistics Tajima's D (67
) was calculated and used to compare the two summary measures and to examine characteristics of the allele frequency spectra.
In order to compare the degree of association (LD) between marker pairs containing indels with those containing only substitutions, two descriptive statistical measures of association, |D'| and r2 (11
,68
,69
), were computed for the different populations. All the earlier statistics were calculated using the computer programs provided by the Kruglyak Lab (http://www.fhcrc.org/labs/kruglyak). The method used an EM-algorithm based approach to infer haplotype frequencies for the pairs of markers required to calculate the LD measures.
| SUPPLEMENTARY MATERIAL |
|---|
|
|
|---|
Supplementary Material is available at HMG Online.
| ACKNOWLEDGEMENTS |
|---|
We would like to thank the members of the laboratory of D.A.N. for their efforts in discovery and cataloging of the variation data: Q. Yi, D. Carrington, C. Hastings, E. Calhoun, J. Smith, T. Shaffer, M. Ozuna, S. Da Ponte, N. Rajkumar, M. Wong, P. Keyes, C. Poel, B. Borrayo, M. Montoya, E. Torskey, M. Wook-Chung, D. Nguyen, K. Sherwood, M. Daniels, C. Nguyen, B. Howie, P. Lee, R. Mackelprang, P. Robertson, W. Shackwitz, A. Sherwood, A. Olson, J.T. Jackson, T. Ritchie, B. Leithauser, J. Sloan, E. Toth, L. Witrak, S. Kuldanek and T. Armel. We would also like to thank J. Akey, C. Carlson, D. Crawford and R. Mackelprang for critical reading of this manuscript. This work was supported by grants from the National Heart Lung and Blood Institute PGA (U01 HL66682) (D.A.N. and M.J.R.) and the National Institute of Environmental Health Sciences (NO1 ES-15478) (D.A.N. and M.J.R.).
| REFERENCES |
|---|
|
|
|---|
-
Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P. et al. (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science, 294, 17191723.
[Abstract/Free Full Text] - Dawson, E., Abecasis, G.R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D.M., Pabial, J., Dibling, T., Tinsley, E., Kirby, S. et al. (2002) A first-generation linkage disequilibrium map of human chromosome 22. Nature, 418, 544548.[CrossRef][Medline]
- The International HapMap Consortium (2003) The International HapMap Project. Nature, 426, 789796.[CrossRef][Medline]
- Phillips, M.S., Lawrence, R., Sachidanandam, R., Morris, A.P., Balding, D.J., Donaldson, M.A., Studebaker, J.F., Ankener, W.M., Alfisi, S.V., Kuo, F.S. et al. (2003) Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat. Genet., 33, 382387.[CrossRef][ISI][Medline]
-
Ke, X., Hunt, S., Tapper, W., Lawrence, R., Stavrides, G., Ghori, J., Whittaker, P., Collins, A., Morris, A.P., Bentley, D. et al. (2004) The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet., 13, 577588.
[Abstract/Free Full Text] -
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A. et al. (2001) The sequence of the human genome. Science, 291, 13041351.
[Abstract/Free Full Text] - Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L. et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928933.[CrossRef][Medline]
-
Stephens, J.C., Schneider, J.A., Tanguay, D.A., Choi, J., Acharya, T., Stanley, S.E., Jiang, R., Messer, C.J., Chew, A., Han, J.H. et al. (2001) Haplotype variation and linkage disequilibrium in 313 human genes. Science, 293, 489493.
[Abstract/Free Full Text] -
Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M. et al. (2002) The structure of haplotype blocks in the human genome. Science, 296, 22252229.
[Abstract/Free Full Text] - Clark, A.G., Nielsen, R., Signorovitch, J., Matise, T.C., Glanowski, S., Heil, J., Winn-Deen, E.S., Holden, A.L. and Lai, E. (2003) Linkage disequilibrium and inference of ancestral recombination in 538 single-nucleotide polymorphism clusters across the human genome. Am. J. Hum. Genet., 73, 285300.[CrossRef][ISI][Medline]
- Pritchard, J.K. and Przeworski, M. (2001) Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet., 69, 114.[CrossRef][ISI][Medline]
- Weber, J.L., David, D., Heil, J., Fan, Y., Zhao, C. and Marth, G. (2002) Human diallelic insertion/deletion polymorphisms. Am. J. Hum. Genet., 71, 854862.[CrossRef][ISI][Medline]
-
Picoult-Newberg, L., Ideker, T.E., Pohl, M.G., Taylor, S.L., Donaldson, M.A., Nickerson, D.A. and Boyce-Jacino, M. (1999) Mining SNPs from EST databases. Genome Res., 9, 167174.
[Abstract/Free Full Text] -
Schmid, K.J., Sorensen, T.R., Stracke, R., Torjek, O., Altmann, T., Mitchell-Olds, T. and Weisshaar, B. (2003) Large-scale identification and analysis of genome-wide single-nucleotide polymorphisms for mapping in Arabidopsis thaliana. Genome Res., 13, 12501257.
[Abstract/Free Full Text] -
Taillon-Miller, P., Gu, Z., Li, Q., Hillier, L. and Kwok, P.Y. (1998) Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res., 8, 748754.
[Abstract/Free Full Text] -
Batley, J., Barker, G., O'Sullivan, H., Edwards, K.J. and Edwards, D. (2003) Mining for single nucleotide polymorphisms and insertions/deletions in maize expressed sequence tag data. Plant Physiol., 132, 8491.
[Abstract/Free Full Text] -
Collins, F.S., Brooks, L.D. and Chakravarti, A. (1998) A DNA polymorphism discovery resource for research on human genetic variation. Genome Res., 8, 12291231.
[Free Full Text] -
Nickerson, D.A., Tobe, V.O. and Taylor, S.L. (1997) PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res., 25, 27452751.
[Abstract/Free Full Text] - Kwok, P.Y., Carlson, C., Yager, T.D., Ankener, W. and Nickerson, D.A. (1994) Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products. Genomics, 23, 138144.[CrossRef][ISI][Medline]
- Parker, L.T., Zakeri, H., Deng, Q., Spurgeon, S., Kwok, P.Y. and Nickerson, D.A. (1996) AmpliTaq DNA polymerase, FS dye-terminator sequencing: analysis of peak height patterns. Biotechniques, 21, 694699.[ISI][Medline]
- Miller, R.D., Taillon-Miller, P. and Kwok, P.Y. (2001) Regions of low single-nucleotide polymorphism incidence in human and orangutan xq: deserts and recent coalescences. Genomics, 71, 7888.[CrossRef][ISI][Medline]
-
Marth, G.T., Czabarka, E., Murvai, J. and Sherry, S.T. (2004) The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics, 166, 351372.
[Abstract/Free Full Text] - Bamshad, M. and Wooding, S.P. (2003) Signatures of natural selection in the human genome. Nat. Rev. Genet., 4, 99111.[CrossRef][ISI][Medline]
- Kruglyak, L. and Nickerson, D.A. (2001) Variation is the spice of life. Nat. Genet., 27, 234236.[CrossRef][ISI][Medline]
-
Kimura, M. (1969) The rate of molecular evolution considered from the standpoint of population genetics. Proc. Natl Acad. Sci. USA, 63, 11811188.
[Abstract/Free Full Text] - Petrov, D.A. (2002) Mutational equilibrium model of genome size evolution. Theor. Popul. Biol., 61, 531544.[CrossRef][ISI][Medline]
-
Boissinot, S., Chevret, P. and Furano, A.V. (2000) L1 (LINE-1) retrotransposon evolution and amplification in recent human history. Mol. Biol. Evol., 17, 915928.
[Abstract/Free Full Text] - Batzer, M.A. and Deininger, P.L. (2002) Alu repeats and human genomic diversity. Nat. Rev. Genet., 3, 370379.[CrossRef][ISI][Medline]
- Wall, J.D. and Pritchard, J.K. (2003) Haplotype blocks and linkage disequilibrium in the human genome. Nat. Rev. Genet., 4, 587597.[CrossRef][ISI][Medline]
- Reich, D.E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P.C., Richter, D.J., Lavery, T., Kouyoumjian, R., Farhadian, S.F., Ward, R. et al. (2001) Linkage disequilibrium in the human genome. Nature, 411, 199204.[CrossRef][Medline]
- Watterson, G.A. and Guess, H.A. (1977) Is the most frequent allele the oldest? Theor. Popul. Biol., 11, 141160.[CrossRef][ISI][Medline]
- Wilson, J.F. and Goldstein, D.B. (2000) Consistent long-range linkage disequilibrium generated by admixture in a Bantu-Semitic hybrid population. Am. J. Hum. Genet., 67, 926935.[CrossRef][ISI][Medline]
- Pritchard, J.K. and Rosenberg, N.A. (1999) Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet., 65, 220228.[CrossRef][ISI][Medline]
- Ardlie, K.G., Kruglyak, L. and Seielstad, M. (2002) Patterns of linkage disequilibrium in the human genome. Nat. Rev. Genet., 3, 299309.[CrossRef][ISI][Medline]
- Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A. and Stephens, M. (2004) Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet., 36, 700706.[CrossRef][ISI][Medline]
- Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L. and Nickerson, D.A. (2004) Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet., 74, 106120.[CrossRef][ISI][Medline]
-
Dawson, E., Chen, Y., Hunt, S., Smink, L.J., Hunt, A., Rice, K., Livingston, S., Bumpstead, S., Bruskiewich, R., Sham, P. et al. (2001) A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence. Genome Res., 11, 170178.
[Abstract/Free Full Text] - Krawczak, M., Ball, E.V., Fenton, I., Stenson, P.D., Abeysinghe, S., Thomas, N. and Cooper, D.N. (2000) Human gene mutation databasea biomedical information and research resource. Hum. Mutat., 15, 4551.[CrossRef][ISI][Medline]
- Robledo, R., Beggs, W. and Bender, P. (2003) A simple and cost-effective method for rapid genotyping of insertion/deletion polymorphisms. Genomics, 82, 580582.[CrossRef][ISI][Medline]
- Syvanen, A.C. (2001) Accessing genetic variation: genotyping single nucleotide polymorphisms. Nat. Rev. Genet., 2, 930942.[CrossRef][ISI][Medline]
-
Zhang, Z. and Gerstein, M. (2003) Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res., 31, 53385348.
[Abstract/Free Full Text] -
Robertson, H.M. (2000) The large srh family of chemoreceptor genes in Caenorhabditis nematodes reveals processes of genome evolution involving large duplications and deletions and intron gains and losses. Genome Res., 10, 192203.
[Abstract/Free Full Text] - Graur, D., Shuali, Y. and Li, W.H. (1989) Deletions in processed pseudogenes accumulate faster in rodents than in humans. J. Mol. Evol., 28, 279285.[CrossRef][ISI][Medline]
- Petrov, D.A. and Hartl, D.L. (1998) High rate of DNA loss in the Drosophila melanogaster and Drosophila virilis species groups. Mol. Biol. Evol., 15, 293302.[Abstract]
-
Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. (2004) Large-scale copy number polymorphism in the human genome. Science, 305, 525528.
[Abstract/Free Full Text] - Armour, J.A., Anttinen, T., May, C.A., Vega, E.E., Sajantila, A., Kidd, J.R., Kidd, K.K., Bertranpetit, J., Paabo, S. and Jeffreys, A.J. (1996) Minisatellite diversity supports a recent African origin for modern humans. Nat. Genet., 13, 154160.[CrossRef][ISI][Medline]
- Harpending, H. and Rogers, A. (2000) Genetic perspectives on human origins and differentiation. Annu. Rev. Genomics Hum. Genet., 1, 361385.[CrossRef][ISI][Medline]
- Watkins, W.S., Ricker, C.E., Bamshad, M.J., Carroll, M.L., Nguyen, S.V., Batzer, M.A., Harpending, H.C., Rogers, A.R. and Jorde, L.B. (2001) Patterns of ancestral human diversity: an analysis of Alu-insertion and restriction-site polymorphisms. Am. J. Hum. Genet., 68, 738752.[CrossRef][ISI][Medline]
- Lin, S.C., Chung, M.Y., Huang, J.W., Shieh, T.M., Liu, C.J. and Chang, K.W. (2004) Correlation between functional genotypes in the matrix metalloproteinases-1 promoter and risk of oral squamous cell carcinomas. J. Oral. Pathol. Med., 33, 323326.[ISI][Medline]
-
Karban, A.S., Okazaki, T., Panhuysen, C.I., Gallegos, T., Potter, J.J., Bailey-Wilson, J.E., Silverberg, M.S., Duerr, R.H., Cho, J.H., Gregersen, P.K. et al., (2004) Functional annotation of a novel NFKB1 promoter polymorphism that increases risk for ulcerative colitis. Hum. Mol. Genet., 13, 3545.
[Abstract/Free Full Text] -
Salem, A.H., Kilroy, G.E., Watkins, W.S., Jorde, L.B. and Batzer, M.A. (2003) Recently integrated Alu elements and human genomic diversity. Mol. Biol. Evol., 20, 13491361.
[Abstract/Free Full Text] - Kruglyak, L. (1999) Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet., 22, 139144.[CrossRef][ISI][Medline]
- Kruglyak, L. (1997) The use of a genetic map of biallelic markers in linkage studies. Nat. Genet., 17, 2124.[CrossRef][ISI][Medline]
- Zondervan, K.T. and Cardon, L.R. (2004) The complex interplay among factors that influence allelic association. Nat. Rev. Genet., 5, 89100.[ISI][Medline]
- Xiong, M. and Jin, L. (1999) Comparison of the power and accuracy of biallelic and microsatellite markers in population-based gene-mapping methods. Am. J. Hum. Genet., 64, 629640.[CrossRef][ISI][Medline]
- Frisse, L., Hudson, R.R., Bartoszewicz, A., Wall, J.D., Donfack, J. and Di Rienzo, A. (2001) Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet., 69, 831843.[CrossRef][ISI][Medline]
- Tishkoff, S.A., Dietzsch, E., Speed, W., Pakstis, A.J., Kidd, J.R., Cheung, K., Bonne-Tamir, B., Santachiara-Benerecetti, A.S., Moral, P. and Krings, M. (1996) Global pat


