Iyer et al., Nature 2019 [https://www.nature.com/articles/s41586-019-1076-8]
(This page was regenerated with R Markdown on Apr 8 2024 and uses updated JavaScript libraries compared to initial version.)
Note: The expanded searchable/sortable/filterable version of Supplementary Table 3 from the paper appears as Table 3 below.
The results below for the most part are based on the files of “coding” variants from gnomAD genomes and exomes, version 2.0.2 (https://console.cloud.google.com/storage/browser/gnomad-public/release/2.0.2/)
These consists of all variants in the intervals used for ExAC (ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/resources/)
Most of these intervals correspond to exons plus 50 flanking bases on each side, and they collectively cover 60 million bases, about 2% of the genome. Note that there are no variant calls for the Y chromosome, and these are not strictly all coding variants, as some are in introns, UTRs, miRNA, ncRNA.
The 1000 Genome Project data was taken from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. The vcf files there include precomputed allele-frequencies for only five broad super-populations; the allele-frequencies for 26 more-specific populations computed from the per-individuals genotypes in the vcf files, aggregated using the population assignments from the file integrated_call_samples_v3.20130502.ALL.panel .
The ClinVar annotations were taken from the file ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2018/clinvar_20180225.vcf.gz Note that matching up variants from two different sources (e.g. ClinVar and gnomAD) can sometimes be tricky, particularly for indels and multi-allelic sites, since the same variant may have multiple representations. (The gnomAD vcf files themselves have ClinVar annotations included, as part of the Ensembl VEP output, but this seems to have many spurious or missing or out-of-date annotations; likewise for the ExAC and TGP allele frequecies included in the ClinVar vcf file.) Here the variants have been decomposed and normalized (trimmed and left-aligned) (vt 0.5772) for the purpose of matching them up (bcftools 1.9) but the HGNC notation used at ClinVar may follow the right-aligned (3’-most position) convention, in which duplications are taken to occur immediately after the repeated sequence rather than immediately before the repeated sequence.
The gnomAD genome files above contain a total of 4851138 distinct variant alleles, of which 145892 (~3%) are insertions. The gnomAD exome files above contain a total of 17009588 distinct variant alleles, of which 414576 (~2.4%) are insertions. Note that many of these variants are common to both the exomes and genomes, but in the tables below variants that occur in both are counted only once.
This table focuses on the insertions, and in particular the duplications. The second column (insertions) gives the counts of all the distinct insertion variant alleles, binned by the length of the insertion (length), with all variants of length at least 40 combined into one bin. Subsequent columns give the number of variants that satify additional criteria, as follows:
“Simple” duplications: We will refer to those duplications that satisfy conditions dup2 and dup2i as “simple” duplications. Although neither condition is strictly necessary for using a Microhomology-Mediated End-Joining (MMEJ) strategy for correction, they can reduce some potential difficulties with the nuclease cleaving the wild-type or already-corrected alleles as well as the targeted duplication-containing allele(s), or having multiple repair products with different numbers of repeats or sub-repeats collapsed. Note that these conditions exclude repeat-expansion disorders such as the polyQ/CAG expansion in Huntington’s disease, and the C9orf72 hexanucleotide expansion in FTD and/or ALS, in which shortening of the expansions to below critical thresholds may be more relevant than repair to a single precise wild-type length:
(The three variants above do not appear to be included in the ClinVar vcf at all — the ClinVar webpages assign them variant type “Microsatellite”, some of which are included in the vcf, so perhaps their absense is due to the lack of a single fixed-length ALT allele in their HGVS descriptions.)
Simple duplications were identified using a modified version of the
function annotate_indels
from the vt tool set; this and other custom
code can be found here.
length | insertions | dup | dup2 | dup2i | dup2iC | dup2iL | dup2iP |
---|---|---|---|---|---|---|---|
1 | 210230 | 179654 | 59169 | 59169 | 399 | 242 | 182 |
2 | 51418 | 29880 | 11919 | 7562 | 53 | 25 | 19 |
3 | 39579 | 23795 | 12892 | 11141 | 77 | 11 | 4 |
4 | 30704 | 18835 | 14615 | 13010 | 112 | 70 | 52 |
5 | 15142 | 6890 | 4754 | 4189 | 28 | 16 | 10 |
6 | 18971 | 11102 | 6125 | 5251 | 46 | 7 | 3 |
7 | 9634 | 3793 | 2976 | 2623 | 10 | 5 | 4 |
8 | 9123 | 3819 | 3038 | 2739 | 12 | 9 | 7 |
9 | 9818 | 5155 | 3979 | 3686 | 17 | 3 | 2 |
10 | 5756 | 1997 | 1683 | 1502 | 12 | 9 | 8 |
11 | 4326 | 1311 | 1236 | 1195 | 10 | 6 | 4 |
12 | 6249 | 3384 | 2957 | 2649 | 18 | 3 | 0 |
13 | 3207 | 1099 | 1068 | 1042 | 7 | 4 | 2 |
14 | 3068 | 1031 | 993 | 942 | 5 | 2 | 1 |
15 | 4307 | 2311 | 2190 | 2110 | 19 | 7 | 3 |
16 | 2813 | 1173 | 1128 | 1086 | 8 | 4 | 4 |
17 | 2438 | 1099 | 1069 | 1067 | 9 | 7 | 6 |
18 | 4316 | 2646 | 2552 | 2459 | 14 | 4 | 4 |
19 | 2065 | 1012 | 997 | 997 | 5 | 3 | 1 |
20 | 2148 | 1082 | 1045 | 1001 | 6 | 5 | 3 |
21 | 3463 | 2218 | 2141 | 2127 | 11 | 2 | 1 |
22 | 1687 | 818 | 806 | 799 | 1 | 0 | 0 |
23 | 1395 | 690 | 670 | 670 | 3 | 3 | 3 |
24 | 2272 | 1283 | 1244 | 1221 | 7 | 2 | 1 |
25 | 1149 | 485 | 477 | 471 | 1 | 1 | 1 |
26 | 1006 | 356 | 353 | 350 | 3 | 0 | 0 |
27 | 1373 | 653 | 635 | 631 | 6 | 1 | 0 |
28 | 878 | 314 | 308 | 304 | 3 | 1 | 1 |
29 | 751 | 239 | 233 | 233 | 1 | 1 | 0 |
30 | 1321 | 579 | 549 | 536 | 1 | 0 | 0 |
31 | 693 | 194 | 189 | 189 | 1 | 1 | 1 |
32 | 695 | 193 | 187 | 182 | 0 | 0 | 0 |
33 | 772 | 272 | 263 | 262 | 3 | 0 | 0 |
34 | 590 | 169 | 164 | 157 | 0 | 0 | 0 |
35 | 528 | 121 | 117 | 116 | 0 | 0 | 0 |
36 | 743 | 244 | 236 | 225 | 1 | 0 | 0 |
37 | 457 | 106 | 102 | 102 | 1 | 0 | 0 |
38 | 474 | 122 | 115 | 113 | 0 | 0 | 0 |
39 | 524 | 149 | 140 | 140 | 2 | 0 | 0 |
40+ | 12413 | 1818 | 1800 | 1756 | 7 | 1 | 1 |
Here are the totals from the table above:
length | insertions | dup | dup2 | dup2i | dup2iC | dup2iL | dup2iP |
---|---|---|---|---|---|---|---|
all | 468496 | 312091 | 147114 | 136004 | 919 | 455 | 328 |
Below is a barplot illustrating the stratification of the 468496 insertions from the table above into progressively finer subcategories. For simplicity the two levels restricting to dup2 and dup2i have been combined into a single levels restricting to “simple” duplications, and the level dup2iL has been omitted between the restrictions to variants listed in ClinVar (dup2iC) and to variants listed as Pathogenic or Pathogenic/Likely_pathogenic in ClinVar (dup2iP).
Below is a more “ClinVar-centric” view of the insertions, beginning with the 5465 insertions annotated as “Pathogenic” or “Pathogenic/Likely_pathogenic” in ClinVar, and stratifying them into progressively finer subcategories, arriving at the final level — those variants also observed at least once in gnomAD exome or genome “coding” regions — with the same set of duplications at the final level above.
(Note that the two plots above differ from Extended Data Fig 10 in the manuscript in that they also include insertions of length 1. Note also that the counts of pathogenic insertions dip for lengths 3, 6, 9 …, as expected since multiples of three may result in in-frame rather than frame-shift mutations to proteins.)
Below are the PAM sequences that are scanned for, taken from the
table at https://www.addgene.org/crispr/guide/#pam-table and from
Hu et al (Nature 2018; doi:10.1038/nature26155) [see also http://blog.addgene.org/xcas9-engineering-a-crispr-variant-with-pam-flexibility].
In the column Expanded the IUPAC codes in the column
Pattern are expanded into a regular expression, with
multiple allowed bases enclosed by brackets. (Note that the R regular
expression scanning function gregexpr
can return multiple
hits per scanned sequence, but will not return overlapping hits; this
uses a modified version, which will return the positions of all matches,
even if they overlap.) The column Legend shows the
characters used to indicate start positions of PAM sequences in sequence
schematics; upper-case characters indicate that this is the only PAM
sequence beginning at the position, and lower-case characters indicate
that there are additional PAM sequences beginning at the position (but
only the character for the first is shown). The column
Side indicates to which side of the guide RNA the PAM
sequence is located.
Legend | Species_and_Variant_of_Cas9 | Side | Pattern | Expanded | CleavageSite | Width |
---|---|---|---|---|---|---|
A/a | xCas9_NG | 3’ | NG | [ACGT]G | -3 from start | 2 |
B/b | xCas9_GAA | 3’ | GAA | GAA | -3 from start | 3 |
C/c | xCas9_GAT | 3’ | GAT | GAT | -3 from start | 3 |
D/d | SpCas9 | 3’ | NGG | [ACGT]GG | -3 from start | 3 |
E/e | SpCas9 VRER variant | 3’ | NGCG | [ACGT]GCG | -3 from start | 4 |
F/f | SpCas9 EQR variant | 3’ | NGAG | [ACGT]GAG | -3 from start | 4 |
G/g | SpCas9 VQR variant | 3’ | NGAN|NGNG | [ACGT]GA[ACGT]|[ACGT]G[ACGT]G | -3 from start | 4 |
H/h | SaCas9 | 3’ | NNGRRT | [ACGT][ACGT]G[AG][AG]T | -3 from start | 6 |
I/i | NMe1 | 3’ | NNNNGATT | [ACGT][ACGT][ACGT][ACGT]GATT | -3 from start | 8 |
J/j | CjeCas9 | 3’ | NNNNRYAC | [ACGT][ACGT][ACGT][ACGT][AG][CT]AC | -3 from start | 8 |
K/k | AsCpf1 and LbCpf1 | 5’ | TTTV | TTT[ACG] | approx +18 from end | 4 |
M/m | AsCpf1 and LbCpf1 RR variant | 5’ | TYCV | T[CT]C[ACG] | approx +18 from end | 4 |
N/n | AsCpf1 RVR variant | 5’ | TATV | TAT[ACG] | approx +18 from end | 4 |
O/o | FnCpf1 | 5’ | TTV | TT[ACG] | approx +18 from end | 3 |
Below is a table of the duplication variants from the final column (dup2iP) of the table above, which are annotated as “Pathogenic” or “Pathogenic/Likely_pathogenic” in ClinVar and are observed in the gnomAD exome (EX_ columns) or genome (GN_ columns) coding regions. Allele frequencies from the 1000 Genome Project (TGP_ columns) are also included, but only a few pathogenic duplications were oberved there, all of which were also observed in gnomAD.
The columns are as follows:
A lot of duplications, no doubt — perhaps around 98% of them — but it does not appear that any of these are annotated as “Pathogenic” in ClinVar. Certainly there are many variants listed in ClinVar that are not observed in either the gnomAD genomes or exomes, so are not accounted for in the table above, and this includes 2189 duplications that satisfy all the additional conditions for being in column dup2iP above. But 2183 of these are in these “coding” intervals, so if the variants had been observed at all in gnomAD they would have been reported in these vcfs. For the 6 that are not within these intervals, one can check for them in the full (not-just-coding-intervals) gnomAD and TGP vcfs — they are not observed there either. These variants are listed below; they are mainly variants in UTRs or in intronic regions >50 bases from the nearest exons (and hence not in the “coding” intervals list).
SEQ | POS | REF | ALT | INS_LEN | ALLELEID | GENEINFO | CLNSIG | CLNDN | OMIM_MOI |
---|---|---|---|---|---|---|---|---|---|
2 | 47656744 | G | GTGAGCCACTGCGCCCAGCA… | 454 | 95998 | MSH2:4436 | Pathogenic | Lynch syndrome | |
10 | 86018468 | A | AG | 1 | 431743 | RGR:5995 | Pathogenic | Retinal dystrophy | |
17 | 1665408 | G | GT | 1 | 50330 | SERPINF1:5176 | Pathogenic | Osteogenesis imperfecta, type VI | |
17 | 29554163 | G | GAGCTTATCAGGTTCTCCAT… | 337 | 213707 | NF1:4763 | Pathogenic | Neurofibromatosis, type 1 | AD |
17 | 29556697 | T | TGGGTACGAGTGTCTGCGTA… | 412 | 213709 | NF1:4763 | Pathogenic | Neurofibromatosis, type 1 | AD |
20 | 18038617 | A | ACCGGTTCCGGCGGCCGGGG… | 22 | 226677 | OVOL2:58495 | Pathogenic | Posterior polymorphous corneal dystrophy 1 | AD |
It wouldn’t be surprising to miss out on variants that are extremely rare in general, or even not-terribly-rare variants that are concentrated in populations without many samples: with only ~100 subjects per population in the TGP data one would expect to miss out on ~13% of alleles with frequency 0.01 in these populations. And a few other possibilities:
Subjects known to have severe pediatric disease were not included in the gnomAD dataset, so variants that cause these diseases may be under-represented, in particular those with dominant inheritance.
About 8% of the genome was masked during the gnomAD variant calling (e.g. some repetitive sequence), so any ClinVar variants the fall in these regions will not be reported in gnomAD. But it appears that only one of the ~13000 insertions from ClinVar falls in one of these masked region — a benign variant in SHOX1 which is masked since it’s in a PAR on the Y chromosome.
gnomAD doesn’t report variants on the Y chromosome at all, whether in masked regions or not. But the only “Pathogenic” duplication on the Y chromosome in ClinVar is a single-base insertion, Y:2655380:C/CT in the gene SRY: https://www.ncbi.nlm.nih.gov/clinvar/variation/470195/
The longest insertion reported in the gnomAD coding regions has length 621, and there are 1431 insertions of length at least 100, but it’s possible that the detection sensitivity may decline for longer insertions. Below is a table of the 9 “Pathogenic”” duplications with length at least 100 from ClinVar that satisfy the conditions of the column dup2iP above, none of which are observed in gnomAD.
SEQ | POS | REF | ALT | INS_LEN | ALLELEID | GENEINFO | CLNSIG | CLNDN | OMIM_MOI |
---|---|---|---|---|---|---|---|---|---|
2 | 47656744 | G | GTGAGCCACTGCGCCCAGCA… | 454 | 95998 | MSH2:4436 | Pathogenic | Lynch syndrome | |
2 | 145156601 | T | TTGGGAGCTAACGGCTTGGA… | 115 | 448705 | ZEB2:9839 | Pathogenic | Mowat-Wilson syndrome | AD |
2 | 200298058 | T | TACCTTGGGCCTGGGCCGCA… | 177 | 214752 | SATB2:23314 | Pathogenic | Chromosome 2q32-q33 deletion syndrome | AD |
3 | 138665088 | G | GTGCGCGGGCGGCGGCCGGA… | 124 | 354130 | FOXL2:668 | Pathogenic | Blepharophimosis, ptosis, and epicanthus inversus | AD |
7 | 150644477 | C | CCGGGCTGGAGAGGGGGATG… | 275 | 424894 | KCNH2:3757 | Pathogenic | Long QT syndrome 2 | AD |
17 | 29554163 | G | GAGCTTATCAGGTTCTCCAT… | 337 | 213707 | NF1:4763 | Pathogenic | Neurofibromatosis, type 1 | AD |
17 | 29556697 | T | TGGGTACGAGTGTCTGCGTA… | 412 | 213709 | NF1:4763 | Pathogenic | Neurofibromatosis, type 1 | AD |
17 | 41251791 | C | CCCAATTCAATGTAGACAGA… | 1058 | 262984 | BRCA1:672 | Pathogenic | Breast-ovarian cancer, familial 1 | AD/MF |
X | 25023006 | G | GGGGGCCATTGTGGAAAAGA… | 103 | 209011 | ARX:170302 | Pathogenic | Lissencephaly 2, X-linked |
Below is a table of “simple” duplications that are annotated in ClinVar as “Pathogenic” or “Pathogenic/Likely pathogenic”, and are associated in ClinVar with a disease annotated as autosomal dominant (AD) in OMIM, but which do not appear in gnomAD: