Microduplications

Website accompanying “Precise therapeutic gene correction by a simple nuclease-induced double-strand break”

Iyer et al., Nature 2019 [https://www.nature.com/articles/s41586-019-1076-8]

(This page was regenerated with R Markdown on Apr 8 2024 and uses updated JavaScript libraries compared to initial version.)

Note: The expanded searchable/sortable/filterable version of Supplementary Table 3 from the paper appears as Table 3 below.

Links to tables that appear below

Table 1: counts of duplications in gnomAD coding regions
Table 2: PAM sequences scanned for in duplications and flanking sequence
Table 3: selected pathogenic duplications from ClinVar that are observed in gnomAD (searchable)
Table 4: selected pathogenic duplications from ClinVar that are outside of gnomAD “coding” regions
Table 5: selected pathogenic duplications of length at least 100 from ClinVar
Table 6: selected pathogenic duplications from ClinVar that are not observed in gnomAD (searchable)

Overview

The results below for the most part are based on the files of “coding” variants from gnomAD genomes and exomes, version 2.0.2 (https://console.cloud.google.com/storage/browser/gnomad-public/release/2.0.2/)

gnomad.genomes.r2.0.2.sites.coding_only.chr1-22.vcf
gnomad.genomes.r2.0.2.sites.coding_only.chrX.vcf
gnomad.exomes.r2.0.2.sites.vcf

These consists of all variants in the intervals used for ExAC (ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/resources/)

exome_calling_regions.v1.interval_list

Most of these intervals correspond to exons plus 50 flanking bases on each side, and they collectively cover 60 million bases, about 2% of the genome. Note that there are no variant calls for the Y chromosome, and these are not strictly all coding variants, as some are in introns, UTRs, miRNA, ncRNA.

The 1000 Genome Project data was taken from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. The vcf files there include precomputed allele-frequencies for only five broad super-populations; the allele-frequencies for 26 more-specific populations computed from the per-individuals genotypes in the vcf files, aggregated using the population assignments from the file integrated_call_samples_v3.20130502.ALL.panel .

The ClinVar annotations were taken from the file ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2018/clinvar_20180225.vcf.gz Note that matching up variants from two different sources (e.g. ClinVar and gnomAD) can sometimes be tricky, particularly for indels and multi-allelic sites, since the same variant may have multiple representations. (The gnomAD vcf files themselves have ClinVar annotations included, as part of the Ensembl VEP output, but this seems to have many spurious or missing or out-of-date annotations; likewise for the ExAC and TGP allele frequecies included in the ClinVar vcf file.) Here the variants have been decomposed and normalized (trimmed and left-aligned) (vt 0.5772) for the purpose of matching them up (bcftools 1.9) but the HGNC notation used at ClinVar may follow the right-aligned (3’-most position) convention, in which duplications are taken to occur immediately after the repeated sequence rather than immediately before the repeated sequence.

The gnomAD genome files above contain a total of 4851138 distinct variant alleles, of which 145892 (~3%) are insertions. The gnomAD exome files above contain a total of 17009588 distinct variant alleles, of which 414576 (~2.4%) are insertions. Note that many of these variants are common to both the exomes and genomes, but in the tables below variants that occur in both are counted only once.

This table focuses on the insertions, and in particular the duplications. The second column (insertions) gives the counts of all the distinct insertion variant alleles, binned by the length of the insertion (length), with all variants of length at least 40 combined into one bin. Subsequent columns give the number of variants that satify additional criteria, as follows:

dup: the insertion is an exact duplication of the immediately adjacent sequence in the GRCh37 reference genome (immediately 3’ with this normalization). Note that there may be polymorphisms in this adjacent sequqnce that affect whether an insertion is indeed a perfect duplication for any given individual.
dup2: the insertion does not add a repeat-unit to what is already a (two-or-more unit) tandem repeat in the reference genome. This eliminates e.g. the duplication of CCCGGG in RAX2, as the reference genome already has two immediately adjacent (3’) tandem copies of this (screengrab from https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs549932754 below)

dup2i: the insertion satisfies the previous constraints and is not itself a perfect tandem repeat (e.g., for a duplicated six-mer, it is not of the form XXXXXX, XYXYXY or XYZXYZ). Note that even if a duplicated sequence is not itself a perfect tandem repeat it may contain internal tandem repeats – e.g. the AGGAGG in the duplicated AAGGAGGATC in NCF4 — so depending on where the Cas9 cleavage site is this may need to be considered, to prevent a shorter internal microduplication from being collapsed instead of the full duplication.
dup2iC: the variant satifies the previous constraints and is also listed in ClinVar.
dup2iL: the variant satifies the previous constraints and is reported in Clinvar as “Pathogenic”, “Pathogenic/Likely_pathogenic”, “Likely_pathogenic” or “Conflicting_interpretations_of_pathogenicity”
dup2iP: the variant satifies the previous constraints and is reported in Clinvar as “Pathogenic” or “Pathogenic/Likely_pathogenic”

“Simple” duplications: We will refer to those duplications that satisfy conditions dup2 and dup2i as “simple” duplications. Although neither condition is strictly necessary for using a Microhomology-Mediated End-Joining (MMEJ) strategy for correction, they can reduce some potential difficulties with the nuclease cleaving the wild-type or already-corrected alleles as well as the targeted duplication-containing allele(s), or having multiple repair products with different numbers of repeats or sub-repeats collapsed. Note that these conditions exclude repeat-expansion disorders such as the polyQ/CAG expansion in Huntington’s disease, and the C9orf72 hexanucleotide expansion in FTD and/or ALS, in which shortening of the expansions to below critical thresholds may be more relevant than repair to a single precise wild-type length:

(The three variants above do not appear to be included in the ClinVar vcf at all — the ClinVar webpages assign them variant type “Microsatellite”, some of which are included in the vcf, so perhaps their absense is due to the lack of a single fixed-length ALT allele in their HGVS descriptions.)

Simple duplications were identified using a modified version of the function annotate_indels from the vt tool set; this and other custom code can be found here.

Table 1: counts of duplications in gnomAD coding regions

length	insertions	dup	dup2	dup2i	dup2iC	dup2iL	dup2iP
1	210230	179654	59169	59169	399	242	182
2	51418	29880	11919	7562	53	25	19
3	39579	23795	12892	11141	77	11	4
4	30704	18835	14615	13010	112	70	52
5	15142	6890	4754	4189	28	16	10
6	18971	11102	6125	5251	46	7	3
7	9634	3793	2976	2623	10	5	4
8	9123	3819	3038	2739	12	9	7
9	9818	5155	3979	3686	17	3	2
10	5756	1997	1683	1502	12	9	8
11	4326	1311	1236	1195	10	6	4
12	6249	3384	2957	2649	18	3	0
13	3207	1099	1068	1042	7	4	2
14	3068	1031	993	942	5	2	1
15	4307	2311	2190	2110	19	7	3
16	2813	1173	1128	1086	8	4	4
17	2438	1099	1069	1067	9	7	6
18	4316	2646	2552	2459	14	4	4
19	2065	1012	997	997	5	3	1
20	2148	1082	1045	1001	6	5	3
21	3463	2218	2141	2127	11	2	1
22	1687	818	806	799	1	0	0
23	1395	690	670	670	3	3	3
24	2272	1283	1244	1221	7	2	1
25	1149	485	477	471	1	1	1
26	1006	356	353	350	3	0	0
27	1373	653	635	631	6	1	0
28	878	314	308	304	3	1	1
29	751	239	233	233	1	1	0
30	1321	579	549	536	1	0	0
31	693	194	189	189	1	1	1
32	695	193	187	182	0	0	0
33	772	272	263	262	3	0	0
34	590	169	164	157	0	0	0
35	528	121	117	116	0	0	0
36	743	244	236	225	1	0	0
37	457	106	102	102	1	0	0
38	474	122	115	113	0	0	0
39	524	149	140	140	2	0	0
40+	12413	1818	1800	1756	7	1	1

Here are the totals from the table above:

length	insertions	dup	dup2	dup2i	dup2iC	dup2iL	dup2iP
all	468496	312091	147114	136004	919	455	328

Below is a barplot illustrating the stratification of the 468496 insertions from the table above into progressively finer subcategories. For simplicity the two levels restricting to dup2 and dup2i have been combined into a single levels restricting to “simple” duplications, and the level dup2iL has been omitted between the restrictions to variants listed in ClinVar (dup2iC) and to variants listed as Pathogenic or Pathogenic/Likely_pathogenic in ClinVar (dup2iP).

Below is a more “ClinVar-centric” view of the insertions, beginning with the 5465 insertions annotated as “Pathogenic” or “Pathogenic/Likely_pathogenic” in ClinVar, and stratifying them into progressively finer subcategories, arriving at the final level — those variants also observed at least once in gnomAD exome or genome “coding” regions — with the same set of duplications at the final level above.

(Note that the two plots above differ from Extended Data Fig 10 in the manuscript in that they also include insertions of length 1. Note also that the counts of pathogenic insertions dip for lengths 3, 6, 9 …, as expected since multiples of three may result in in-frame rather than frame-shift mutations to proteins.)

PAM sequences

Below are the PAM sequences that are scanned for, taken from the table at https://www.addgene.org/crispr/guide/#pam-table and from Hu et al (Nature 2018; doi:10.1038/nature26155) [see also http://blog.addgene.org/xcas9-engineering-a-crispr-variant-with-pam-flexibility]. In the column Expanded the IUPAC codes in the column Pattern are expanded into a regular expression, with multiple allowed bases enclosed by brackets. (Note that the R regular expression scanning function gregexpr can return multiple hits per scanned sequence, but will not return overlapping hits; this uses a modified version, which will return the positions of all matches, even if they overlap.) The column Legend shows the characters used to indicate start positions of PAM sequences in sequence schematics; upper-case characters indicate that this is the only PAM sequence beginning at the position, and lower-case characters indicate that there are additional PAM sequences beginning at the position (but only the character for the first is shown). The column Side indicates to which side of the guide RNA the PAM sequence is located.

Table 2: PAM sequences scanned for in duplications and flanking sequence

Legend	Species_and_Variant_of_Cas9	Side	Pattern	Expanded	CleavageSite	Width
A/a	xCas9_NG	3’	NG	[ACGT]G	-3 from start	2
B/b	xCas9_GAA	3’	GAA	GAA	-3 from start	3
C/c	xCas9_GAT	3’	GAT	GAT	-3 from start	3
D/d	SpCas9	3’	NGG	[ACGT]GG	-3 from start	3
E/e	SpCas9 VRER variant	3’	NGCG	[ACGT]GCG	-3 from start	4
F/f	SpCas9 EQR variant	3’	NGAG	[ACGT]GAG	-3 from start	4
G/g	SpCas9 VQR variant	3’	NGAN\|NGNG	[ACGT]GA[ACGT]\|[ACGT]G[ACGT]G	-3 from start	4
H/h	SaCas9	3’	NNGRRT	[ACGT][ACGT]G[AG][AG]T	-3 from start	6
I/i	NMe1	3’	NNNNGATT	[ACGT][ACGT][ACGT][ACGT]GATT	-3 from start	8
J/j	CjeCas9	3’	NNNNRYAC	[ACGT][ACGT][ACGT][ACGT][AG][CT]AC	-3 from start	8
K/k	AsCpf1 and LbCpf1	5’	TTTV	TTT[ACG]	approx +18 from end	4
M/m	AsCpf1 and LbCpf1 RR variant	5’	TYCV	T[CT]C[ACG]	approx +18 from end	4
N/n	AsCpf1 RVR variant	5’	TATV	TAT[ACG]	approx +18 from end	4
O/o	FnCpf1	5’	TTV	TT[ACG]	approx +18 from end	3

Notes on PAM sequences:

In the table above the rows for the AsCpf1 RR variant and the LbCpf1 RR variant are combined as their PAM sequences in the table at Addgene are identical.
The entry for SpCas9 D1135E variant with PAM sequence “3’ NGG (reduced NAG binding)” is omitted, since the NGG is identical to the usual SpCas9 PAM, although one could scan for N[GA]G instead to pick up the the weaker NAG binding as well.
For SpCas9 VQR variant the pattern picks up either of the motifs in the Addgene PAM sequence “3’ NGAN or NGNG”.
For SaCas9 the Addgene PAM sequence is “3’ NNGRRT or NNGRR(N)” — here just the first of these is used, as the second pattern would subsume the former; according the Addgene manual the efficiency is best for T but other bases can be consisdered when evaluating off-target effects.
The predicted cleavage sites are 3 bases before the start of the PAM for Cas9, and 18 bases after the end of the PAMs for Cpf1s — these Cpf1 cleavage sites may vary slightly, and the non-targeting strand is cleaved further from the PAM, leaving a 4-5nt overhang (Zetche et al, Cell 2015; doi:10.1016/j.cell.2015.09.038)
The letters L/l are not used in the Legend column to avoid confusion with I/i for some fonts.

Selected duplications that are in gnomAD as well as ClinVar

Below is a table of the duplication variants from the final column (dup2iP) of the table above, which are annotated as “Pathogenic” or “Pathogenic/Likely_pathogenic” in ClinVar and are observed in the gnomAD exome (EX_ columns) or genome (GN_ columns) coding regions. Allele frequencies from the 1000 Genome Project (TGP_ columns) are also included, but only a few pathogenic duplications were oberved there, all of which were also observed in gnomAD.

The columns are as follows:

SEQ - Chromosome for variant
POS - Start position of REF allele for variant (trimmed and left-normalized)
REF - Reference allele; following VCF convention this has at least one nucleotide even for insertions
ALT - Alternate allele; note that for insertions the first nucleotide is the REF allele and is not itself part of the inserted sequence
INS_LENGTH - Length of inserted sequence (here length of ALT minus length of REF)
ALLELEID - From ClinVar: Allele ID of variant; clicking on link will load page from ClinVar
GENEINFO - From ClinVar: symbol and ID (separated by colon) of gene impacted by variant
MAX_AF_ANY - The maximum allele-frequency from any population in gnomAD or TGP (maximum of GN_AF_POPMAX, EX_AF_POPMAX, and TGP_AF_POPMAX); the corresponding allele count (AC) and total allele number (AN) can be found in the columns for each dataset. Note that the variance in these estimates can be high, particulary for TGP which has smaller populations.
MAX_AF_WHICH - Indicates the dataset (GN, EX, or TGP) and population (three-letter code) that attains this maximum (see descriptions of other columns below for details).
CLNSIG - From ClinVar: clinical significance, with conflicts handles are described here: https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/
CLNDN - From ClinVar: disease name(s); this sometimes (but not usually) includes information on mode-of-inheritance. Some variants annotated as pathogenic in ClinVar do not have any associated diseases listed in this columnn, but clicking on the ALLELEID entry will load the ClinVar webpage for the variant, which is some cases does have information on the associated disease in the “Assertion and evidence details” section.
CLNHGVS - From ClinVar: HGVS notation for effect on genome (https://varnomen.hgvs.org/)
CLNOMIM - From Clinvar: OMIM ID(s) associated with disease, excerpted from the CLNDISDB field in the ClinVar vcf
OMOM_MOI - From OMIM: modes-of-inheritance associated with IDs in CLNOMIM, taken from the “Phenotype” column of the table genemap2.txt from https://omim.org/downloads/ (registration required). Abbreviations: AD: Autosomal; AR: Autosomal recessive; MF: Multifactorial; MF: Mitochondrial, XLD: X-linked dominant; DR: Digenic recessive. Note that the modes-of-inheritance listed refer to the disease with the given OMIM ID, and different variants or genes associated with the disease may have different modes.
RS - From ClinVar: dbSNP ID (rsID); different ALT alleles at the same postion can have the same rsID, so this does not uniquely identify the variant. Also, this rsID may not agree with that in the VEP Existing_variation column (due e.g. to merged or changed IDs, or differences in normalization of variant)
GNOMAD_ID - Combination of SEQ-POS-REF-ALT; clicking link loads gnomAD page for variant
IMPACT - From VEP: HIGH (e.g. nonsense, splice-site, frame-shift), MODERATE (e.g. missense, in-frame), LOW (e.g. synonymous), or MODIFIER predicted impact. See https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#output for details on this and other VEP fields. VEP version 85 was used, and the consequence to show is chosen with a preference for (HC > LC > no) LoF, then (HIGH > MODERATE > LOW > MODIFIER) IMPACT, then by distance to nearest gene (in none overlap), then with a preference for canonical transcripts, then in reverse alphabetical order by gene-symbol.
SYMBOL - From VEP: symbol for impacted gene.
HGVSc - From VEP: HGVS notation for effect on transcript (https://varnomen.hgvs.org/)
HGVSp - From VEP: HGVS notation for effect on protein (https://varnomen.hgvs.org/)
DISTANCE - From VEP: shortest distance from variant to transcript (blank when variant is within 8bp of exon)
Existing_variation - From VEP: rsID and other IDs; rsID may not agree with RS from ClinVar.
LoF - Predicted loss-of-function from the VEP LOFTEE plugin (HC/LC for High/Low confidence)
(EX|GN)_FILTER - FILTER field for variant in gnomAD exomes (EX) or genomes (GN); AC0 means no high-confidence ALT alleles were called
(EX|GN)_NUM_ALT - Number of distinct ALT alleles at this position in gnomAD exomes (EX) or genomes (GN); greater than one indicates multi-allelic sites, for which extra caution is needed when matching up variant IDs and allele frequencies with other databases.
(GN|EX|TGP)_(AC|AN|AF) – The columns *_AC and *_AN give the counts of the ALT allele (*_AC) and all alleles (*_AN) from reference datasets; their ratio gives the allele frequency (*_AF). Here the prefixes GN, EX, and TGP indicate that the data is from the gnomAD genomes, gnomAD exomes, or 1000 Genome Project phase 3 data.
BOTH_* , - allele counts (AC), total allele numbers (AN) and allele frequencies (AF) for the gnomAD genomes and exomes combined. Note that when a variant is observed in the exomes but not genomes these just reduce to the AC/AN/AF for the exomes; this is not really a combined AF as it doesn’t account for the number of alleles with REF basecalls in the genomes (which could be anywhere from 0 to ~30,000 — this info is not present in the vcf for unobserved variants but could perhaps be extracted from the coverage files. This is the same behavior as on the gnomAD website, and the same caveat applies to variants observed in the genomes but not the exomes.
*_POPMAX - The population with the highest allele frequency for this variant, and its AC/AN/AF. See http://gnomad.broadinstitute.org/faq for the meaning of the three-letter codes for populations in gnomAD and the total number of exomes and genomes included from each population. See http://www.internationalgenome.org/faq/which-populations-are-part-your-study for the three-letter codes for the 26 populations for the TGP data, most of which included approximately 100 samples (range of 61 to 113).
CONTEXT - Shows the following tracks, from top to bottom, lined-up so that sequence-positions correspond between tracks:
1. SEQ_DUP shows the duplicated sequence in upper-case (including the extra copy on the variant allele), and flanking sequence in lower case.
2. DUP_NUM labels the copies of the duplicated segment (1 for first copy, 2 for second); these positions are also color-coded yellow and cyan in the html table, but the colors are lost when exporting the table as a CSV or Excel file.
3. Cas9_Wa shows cleavage sites for Cas9 enzymes on the Watson strand, 3 bases left of PAM starts from PAM_Wa_* columns.
4. Cas9_Cr shows cleavage sites for Cas9 enzymes on the Crick strand, 4 bases right of PAM starts from PAM_Cr_* columns.
5. Cpf1_Wa shows approximate cleavage sites for Cpf1 enzymes on the Watson strand, 19 bases right of PAM ends (from the PAM starts in PAM_Wa_* columns and adjusted for motif widths).
6. Cpf1_Cr shows approximate cleavage site for Cpf1 enzymes on the Crick strand, 18 bases left of PAM ends (from the PAM starts in PAM_Cr_* and adjusted for motif widths).
- The +/-1 base differences in shifts between Watson and Crick tracks is so that cleavage positions are to the immediate left of the indicated base in both cases (which wouldn’t be an issue if we were labelling the spaces between bases rather than the bases themselves).
- The cleavage sites are labeled according to the Legend column in the table of PAM sequences above, with an upper-case letter if it’s the only matching PAM sequence, and a lower-case letter if it’s the first of more-then-one matching PAM sequence.
- The Cpf1 cleavage sites are staggered on the two strands, leaving an overhang of 4-5 the double-stranded break, not indicated in these schematics
- Motifs are scanned for in flanking regions of size 50 and the CONTEXT column includes flanking regions of size 25, so cleavage sites should be shown even if the PAM site itself does not fall within the displayed sequence (as the distance between the cleavage site and the furthest position in the PAM site is no more than 25 bases).
- Here’s a more compact representation of the information in the CONTEXT column as an image, but embedding an image for each entry in the table bogs things down:
Cut_* - The cut site from the following PAM_* columns that has coordinate closest to zero (position exactly between the two copies of the duplication), separately for the Watson (Cut_Wa) and Crick (“Cut_Cr”) strands. The PAM sequence corresponding to this cut is listed in parentheses (just the first one from the PAM table if there is more one with this predicted cut position). Caution: for Cpf1s with staggered DSBs only the coordinate on the targeted strand in considered, and it may be that a Cas9 with a blunt DSB may be preferable even if further from the central position.
PAM_* - Predicted cut sites and start sites for the indicated PAM motif (suffix of column name) on the Watson (PAM_Wa_* ) or Crick (PAM_Cr_* ) strand. Coordinates are set zero indicates cleavage at the bond exactly between the two copies of the duplicated sequences for cuts, and zero inicates the first base in the second copy of the duplication for PAM starts. Negatives numbers are positions 5’ of this and positive numbers are positions 3’ of this, always on the Watson strand. For Cas9s, if -INS_LEN < coord < +INS_LEN then the predicted double-stranded break site is somewhere in the duplicated sequences, leaving INS_LEN - abs(coord) homologous bases on either side of the cut. For Cpf1s the situation is a bit more complicated due to the staggered cut.

Notes on table of variants:

There are many fewer samples in the TGP (~2,500) than in the gnomAD genomes (~15,000) or exomes (~120,000), so many rare variants seen in the gnomAD datasets are not seen at all in the TGP — one advantage of the TGP data is that it categorizes subjects into narrower populations, so may reveal alleles that have much greater frequency in specfic populations. See for example ClinVar Allele 20316 — a duplication in HPS1 associated with Hermansky-Pudlak syndrome, and which has much higher frequency in Puerto Ricans than in other TGP populations or in the broader gnomAD populations; see screengrabs from https://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/?gts=rs281865163 and http://gnomad.broadinstitute.org/variant/10-100183554-T-TGGGCCTCCCCTGCTGG below:

gnomAD exomes gnomAD genomes

There do not appear to be many other variants with the same flavor as that HPS1 example, though: only two other duplications in the table below appear in TGP at all, and although those are likewise enriched in specific populations, both are duplications of length just one.
Clicking on a column heading will sort by that column, and filters can be applied to individual columns (sliders, multiple-choice, or text-matching, as appropriate). E.g., one can set the slider for the INS_LEN column to restrict to duplications of length 4 to 20. Click on the circled-x in a filter box to cancel a filter. The main “Search” box applies to all columns; it is case-sensitive and allows regular expressions in JavaScript syntax — e.g. entering (TCAP|HPS1|HEXA|DOK7) in the search box would match any of those four strings, and entering (M|m)usc would match Muscle, muscle, Muscular, muscular,…
Clicking the “CSV” or “Excel” button will save the table as a comma-separated file of an xlsx file, both of which can be opened in Excel; you may want to change the CONTEXT column to a monospace font such as “Courier”, so that when you double-click on an entry the multiple rows from the CONTEXT field can line-up properly as in the screengrab below (which may have used a different PAM-list, so the details of the tracks may differ):

(The above relies on Excel preferentially wrapping the text on white-spaces; you may need to fiddle with how far this column is from the right-edge of the Excel window to keep it from introducing too few or too many line-breaks; you could also try replacing the semicolons by line-breaks in Excel, https://answers.microsoft.com/en-us/msoffice/forum/msoffice_excel-mso_other-mso_2007/how-can-i-replace-commas-with-a-line-feed-in-a/55390740-16a4-48c8-85d8-c00e5ded2ed2, but how this works may depend on your operating system, and with my version of Excel it also triggers undesireable within-cell word-wrapping)

Table 3: selected pathogenic duplications from ClinVar that are observed in gnomAD

What are we missing out on by only looking at the “coding” regions (exome_calling_regions.v1.interval_list) in the gnomAD genomes?

A lot of duplications, no doubt — perhaps around 98% of them — but it does not appear that any of these are annotated as “Pathogenic” in ClinVar. Certainly there are many variants listed in ClinVar that are not observed in either the gnomAD genomes or exomes, so are not accounted for in the table above, and this includes 2189 duplications that satisfy all the additional conditions for being in column dup2iP above. But 2183 of these are in these “coding” intervals, so if the variants had been observed at all in gnomAD they would have been reported in these vcfs. For the 6 that are not within these intervals, one can check for them in the full (not-just-coding-intervals) gnomAD and TGP vcfs — they are not observed there either. These variants are listed below; they are mainly variants in UTRs or in intronic regions >50 bases from the nearest exons (and hence not in the “coding” intervals list).

Table 4: selected pathogenic duplications from ClinVar that are outside of gnomAD “coding” regions

SEQ	POS	REF	ALT	INS_LEN	ALLELEID	GENEINFO	CLNSIG	CLNDN	OMIM_MOI
2	47656744	G	GTGAGCCACTGCGCCCAGCA…	454	95998	MSH2:4436	Pathogenic	Lynch syndrome
10	86018468	A	AG	1	431743	RGR:5995	Pathogenic	Retinal dystrophy
17	1665408	G	GT	1	50330	SERPINF1:5176	Pathogenic	Osteogenesis imperfecta, type VI
17	29554163	G	GAGCTTATCAGGTTCTCCAT…	337	213707	NF1:4763	Pathogenic	Neurofibromatosis, type 1	AD
17	29556697	T	TGGGTACGAGTGTCTGCGTA…	412	213709	NF1:4763	Pathogenic	Neurofibromatosis, type 1	AD
20	18038617	A	ACCGGTTCCGGCGGCCGGGG…	22	226677	OVOL2:58495	Pathogenic	Posterior polymorphous corneal dystrophy 1	AD

What other pathogenic duplicates in ClinVar might we be missing when looking at gnomAD/TGP for allele frequencies?

It wouldn’t be surprising to miss out on variants that are extremely rare in general, or even not-terribly-rare variants that are concentrated in populations without many samples: with only ~100 subjects per population in the TGP data one would expect to miss out on ~13% of alleles with frequency 0.01 in these populations. And a few other possibilities:

Subjects known to have severe pediatric disease were not included in the gnomAD dataset, so variants that cause these diseases may be under-represented, in particular those with dominant inheritance.
About 8% of the genome was masked during the gnomAD variant calling (e.g. some repetitive sequence), so any ClinVar variants the fall in these regions will not be reported in gnomAD. But it appears that only one of the ~13000 insertions from ClinVar falls in one of these masked region — a benign variant in SHOX1 which is masked since it’s in a PAR on the Y chromosome.
gnomAD doesn’t report variants on the Y chromosome at all, whether in masked regions or not. But the only “Pathogenic” duplication on the Y chromosome in ClinVar is a single-base insertion, Y:2655380:C/CT in the gene SRY: https://www.ncbi.nlm.nih.gov/clinvar/variation/470195/
The longest insertion reported in the gnomAD coding regions has length 621, and there are 1431 insertions of length at least 100, but it’s possible that the detection sensitivity may decline for longer insertions. Below is a table of the 9 “Pathogenic”” duplications with length at least 100 from ClinVar that satisfy the conditions of the column dup2iP above, none of which are observed in gnomAD.

Table 5: selected pathogenic duplications of length at least 100 from ClinVar

SEQ	POS	REF	ALT	INS_LEN	ALLELEID	GENEINFO	CLNSIG	CLNDN	OMIM_MOI
2	47656744	G	GTGAGCCACTGCGCCCAGCA…	454	95998	MSH2:4436	Pathogenic	Lynch syndrome
2	145156601	T	TTGGGAGCTAACGGCTTGGA…	115	448705	ZEB2:9839	Pathogenic	Mowat-Wilson syndrome	AD
2	200298058	T	TACCTTGGGCCTGGGCCGCA…	177	214752	SATB2:23314	Pathogenic	Chromosome 2q32-q33 deletion syndrome	AD
3	138665088	G	GTGCGCGGGCGGCGGCCGGA…	124	354130	FOXL2:668	Pathogenic	Blepharophimosis, ptosis, and epicanthus inversus	AD
7	150644477	C	CCGGGCTGGAGAGGGGGATG…	275	424894	KCNH2:3757	Pathogenic	Long QT syndrome 2	AD
17	29554163	G	GAGCTTATCAGGTTCTCCAT…	337	213707	NF1:4763	Pathogenic	Neurofibromatosis, type 1	AD
17	29556697	T	TGGGTACGAGTGTCTGCGTA…	412	213709	NF1:4763	Pathogenic	Neurofibromatosis, type 1	AD
17	41251791	C	CCCAATTCAATGTAGACAGA…	1058	262984	BRCA1:672	Pathogenic	Breast-ovarian cancer, familial 1	AD/MF
X	25023006	G	GGGGGCCATTGTGGAAAAGA…	103	209011	ARX:170302	Pathogenic	Lissencephaly 2, X-linked

Below is a table of “simple” duplications that are annotated in ClinVar as “Pathogenic” or “Pathogenic/Likely pathogenic”, and are associated in ClinVar with a disease annotated as autosomal dominant (AD) in OMIM, but which do not appear in gnomAD: