Website accompanying “Precise therapeutic gene correction by a simple nuclease-induced double-strand break”

Note: The expanded searchable/sortable/filterable version of Supplementary Table 3 from the manuscript appears as Table 3 below.

Overview

The results below for the most part are based on the files of “coding” variants from gnomAD genomes and exomes, version 2.0.2 (https://console.cloud.google.com/storage/browser/gnomad-public/release/2.0.2/)

  • gnomad.genomes.r2.0.2.sites.coding_only.chr1-22.vcf
  • gnomad.genomes.r2.0.2.sites.coding_only.chrX.vcf
  • gnomad.exomes.r2.0.2.sites.vcf

These consists of all variants in the intervals used for ExAC (ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/resources/)

  • exome_calling_regions.v1.interval_list

Most of these intervals correspond to exons plus 50 flanking bases on each side, and they collectively cover 60 million bases, about 2% of the genome. Note that there are no variant calls for the Y chromosome, and these are not strictly all coding variants, as some are in introns, UTRs, miRNA, ncRNA.

The 1000 Genome Project data was taken from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. The vcf files there include precomputed allele-frequencies for only five broad super-populations; the allele-frequencies for 26 more-specific populations computed from the per-individuals genotypes in the vcf files, aggregated using the population assignments from the file integrated_call_samples_v3.20130502.ALL.panel .

The ClinVar annotations were taken from the file ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2018/clinvar_20180225.vcf.gz Note that matching up variants from two different sources (e.g. ClinVar and gnomAD) can sometimes be tricky, particularly for indels and multi-allelic sites, since the same variant may have multiple representations. (The gnomAD vcf files themselves have ClinVar annotations included, as part of the Ensembl VEP output, but this seems to have many spurious or missing or out-of-date annotations; likewise for the ExAC and TGP allele frequecies included in the ClinVar vcf file.) Here the variants have been decomposed and normalized (trimmed and left-aligned) (vt 0.5772) for the purpose of matching them up (bcftools 1.9) but the HGNC notation used at ClinVar may follow the right-aligned (3’-most position) convention, in which duplications are taken to occur immediately after the repeated sequence rather than immediately before the repeated sequence.

The gnomAD genome files above contain a total of 4851138 distinct variant alleles, of which 145892 (~3%) are insertions. The gnomAD exome files above contain a total of 17009588 distinct variant alleles, of which 414576 (~2.4%) are insertions. Note that many of these variants are common to both the exomes and genomes, but in the tables below variants that occur in both are counted only once.

This table focuses on the insertions, and in particular the duplications. The second column (insertions) gives the counts of all the distinct insertion variant alleles, binned by the length of the insertion (length), with all variants of length at least 40 combined into one bin. Subsequent columns give the number of variants that satify additional criteria, as follows:

  • dup: the insertion is an exact duplication of the immediately adjacent sequence in the GRCh37 reference genome (immediately 3’ with this normalization). Note that there may be polymorphisms in this adjacent sequqnce that affect whether an insertion is indeed a perfect duplication for any given individual.
  • dup2: the insertion does not add a repeat-unit to what is already a (two-or-more unit) tandem repeat in the reference genome. This eliminates e.g. the duplication of CCCGGG in RAX2, as the reference genome already has two immediately adjacent (3’) tandem copies of this (screengrab from https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs549932754 below)
 

 

  • dup2i: the insertion satisfies the previous constraints and is not itself a perfect tandem repeat (e.g., for a duplicated six-mer, it is not of the form XXXXXX, XYXYXY or XYZXYZ). Note that even if a duplicated sequence is not itself a perfect tandem repeat it may contain internal tandem repeats – e.g. the AGGAGG in the duplicated AAGGAGGATC in NCF4 — so depending on where the Cas9 cleavage site is this may need to be considered, to prevent a shorter internal microduplication from being collapsed instead of the full duplication.
  • dup2iC: the variant satifies the previous constraints and is also listed in ClinVar.
  • dup2iL: the variant satifies the previous constraints and is reported in Clinvar as “Pathogenic”, “Pathogenic/Likely_pathogenic”, “Likely_pathogenic” or “Conflicting_interpretations_of_pathogenicity”
  • dup2iP: the variant satifies the previous constraints and is reported in Clinvar as “Pathogenic” or “Pathogenic/Likely_pathogenic”

“Simple” duplications: We will refer to those duplications that satisfy conditions dup2 and dup2i as “simple” duplications. Although neither condition is strictly necessary for using a Microhomology-Mediated End-Joining (MMEJ) strategy for correction, they can reduce some potential difficulties with the nuclease cleaving the wild-type or already-corrected alleles as well as the targeted duplication-containing allele(s), or having multiple repair products with different numbers of repeats or sub-repeats collapsed. Note that these conditions exclude repeat-expansion disorders such as the polyQ/CAG expansion in Huntington’s disease, and the C9orf72 hexanucleotide expansion in FTD and/or ALS, in which shortening of the expansions to below critical thresholds may be more relevant than repair to a single precise wild-type length:

(The three variants above do not appear to be included in the ClinVar vcf at all — the ClinVar webpages assign them variant type “Microsatellite”, some of which are included in the vcf, so perhaps their absense is due to the lack of a single fixed-length ALT allele in their HGVS descriptions.)

Simple duplications were identified using a modified version of the function annotate_indels from the vt tool set; this and other custom code can be found here.

Table 1: counts of duplications in gnomAD coding regions

length insertions dup dup2 dup2i dup2iC dup2iL dup2iP
1 210230 179654 59169 59169 399 242 182
2 51418 29880 11919 7562 53 25 19
3 39579 23795 12892 11141 77 11 4
4 30704 18835 14615 13010 112 70 52
5 15142 6890 4754 4189 28 16 10
6 18971 11102 6125 5251 46 7 3
7 9634 3793 2976 2623 10 5 4
8 9123 3819 3038 2739 12 9 7
9 9818 5155 3979 3686 17 3 2
10 5756 1997 1683 1502 12 9 8
11 4326 1311 1236 1195 10 6 4
12 6249 3384 2957 2649 18 3 0
13 3207 1099 1068 1042 7 4 2
14 3068 1031 993 942 5 2 1
15 4307 2311 2190 2110 19 7 3
16 2813 1173 1128 1086 8 4 4
17 2438 1099 1069 1067 9 7 6
18 4316 2646 2552 2459 14 4 4
19 2065 1012 997 997 5 3 1
20 2148 1082 1045 1001 6 5 3
21 3463 2218 2141 2127 11 2 1
22 1687 818 806 799 1 0 0
23 1395 690 670 670 3 3 3
24 2272 1283 1244 1221 7 2 1
25 1149 485 477 471 1 1 1
26 1006 356 353 350 3 0 0
27 1373 653 635 631 6 1 0
28 878 314 308 304 3 1 1
29 751 239 233 233 1 1 0
30 1321 579 549 536 1 0 0
31 693 194 189 189 1 1 1
32 695 193 187 182 0 0 0
33 772 272 263 262 3 0 0
34 590 169 164 157 0 0 0
35 528 121 117 116 0 0 0
36 743 244 236 225 1 0 0
37 457 106 102 102 1 0 0
38 474 122 115 113 0 0 0
39 524 149 140 140 2 0 0
40+ 12413 1818 1800 1756 7 1 1

Here are the totals from the table above:

length insertions dup dup2 dup2i dup2iC dup2iL dup2iP
all 468496 312091 147114 136004 919 455 328

Below is a barplot illustrating the stratification of the 468496 insertions from the table above into progressively finer subcategories. For simplicity the two levels restricting to dup2 and dup2i have been combined into a single levels restricting to “simple” duplications, and the level dup2iL has been omitted between the restrictions to variants listed in ClinVar (dup2iC) and to variants listed as Pathogenic or Pathogenic/Likely_pathogenic in ClinVar (dup2iP).

 

Below is a more “ClinVar-centric” view of the insertions, beginning with the 5465 insertions annotated as “Pathogenic” or “Pathohgenic/Likely_pathogenic” in ClinVar, and stratifying them into progressively finer subcategories, arriving at the final level — those variants also observed at least once in gnomAD exome or genome “coding” regions — with the same set of duplications at the final level above.

(Note that the two plots above differ from Extended Data Fig 10 in the manuscript in that they also include insertions of length 1. Note also that the counts of pathogenic insertions dip for lengths 3, 6, 9 …, as expected since multiples of three may result in in-frame rather than frame-shift mutations to proteins.)

PAM sequences

Below are the PAM sequences that are scanned for, taken from the table at https://www.addgene.org/crispr/guide/#pam-table and from Hu et al (Nature 2018; doi:10.1038/nature26155) [see also http://blog.addgene.org/xcas9-engineering-a-crispr-variant-with-pam-flexibility]. In the column Expanded the IUPAC codes in the column Pattern are expanded into a regular expression, with multiple allowed bases enclosed by brackets. (Note that the R regular expression scanning function gregexpr can return multiple hits per scanned sequence, but will not return overlapping hits; this uses a modified version, which will return the positions of all matches, even if they overlap.) The column Legend shows the characters used to indicate start positions of PAM sequences in sequence schematics; upper-case characters indicate that this is the only PAM sequence beginning at the position, and lower-case characters indicate that there are additional PAM sequences beginning at the position (but only the character for the first is shown). The column Side indicates to which side of the guide RNA the PAM sequence is located.

Table 2: PAM sequences scanned for in duplications and flanking sequence

Legend Species_and_Variant_of_Cas9 Side Pattern Expanded CleavageSite Width
A/a xCas9_NG 3’ NG [ACGT]G -3 from start 2
B/b xCas9_GAA 3’ GAA GAA -3 from start 3
C/c xCas9_GAT 3’ GAT GAT -3 from start 3
D/d SpCas9 3’ NGG [ACGT]GG -3 from start 3
E/e SpCas9 VRER variant 3’ NGCG [ACGT]GCG -3 from start 4
F/f SpCas9 EQR variant 3’ NGAG [ACGT]GAG -3 from start 4
G/g SpCas9 VQR variant 3’ NGAN|NGNG [ACGT]GA[ACGT]|[ACGT]G[ACGT]G -3 from start 4
H/h SaCas9 3’ NNGRRT [ACGT][ACGT]G[AG][AG]T -3 from start 6
I/i NMe1 3’ NNNNGATT [ACGT][ACGT][ACGT][ACGT]GATT -3 from start 8
J/j CjeCas9 3’ NNNNRYAC [ACGT][ACGT][ACGT][ACGT][AG][CT]AC -3 from start 8
K/k AsCpf1 and LbCpf1 5’ TTTV TTT[ACG] approx +18 from end 4
M/m AsCpf1 and LbCpf1 RR variant 5’ TYCV T[CT]C[ACG] approx +18 from end 4
N/n AsCpf1 RVR variant 5’ TATV TAT[ACG] approx +18 from end 4
O/o FnCpf1 5’ TTV TT[ACG] approx +18 from end 3

Notes on PAM sequences:

  • In the table above the rows for the AsCpf1 RR variant and the LbCpf1 RR variant are combined as their PAM sequences in the table at Addgene are identical.
  • The entry for SpCas9 D1135E variant with PAM sequence “3’ NGG (reduced NAG binding)” is omitted, since the NGG is identical to the usual SpCas9 PAM, although one could scan for N[GA]G instead to pick up the the weaker NAG binding as well.
  • For SpCas9 VQR variant the pattern picks up either of the motifs in the Addgene PAM sequence “3’ NGAN or NGNG”.
  • For SaCas9 the Addgene PAM sequence is “3’ NNGRRT or NNGRR(N)” — here just the first of these is used, as the second pattern would subsume the former; according the Addgene manual the efficiency is best for T but other bases can be consisdered when evaluating off-target effects.
  • The predicted cleavage sites are 3 bases before the start of the PAM for Cas9, and 18 bases after the end of the PAMs for Cpf1s — these Cpf1 cleavage sites may vary slightly, and the non-targeting strand is cleaved further from the PAM, leaving a 4-5nt overhang (Zetche et al, Cell 2015; doi:10.1016/j.cell.2015.09.038)
  • The letters L/l are not used in the Legend column to avoid confusion with I/i for some fonts.

Selected duplications that are in gnomAD as well as ClinVar

Below is a table of the duplication variants from the final column (dup2iP) of the table above, which are annotated as “Pathogenic” or “Pathogenic/Likely_pathogenic” in ClinVar and are observed in the gnomAD exome (EX_ columns) or genome (GN_ columns) coding regions. Allele frequencies from the 1000 Genome Project (TGP_ columns) are also included, but only a few pathogenic duplications were oberved there, all of which were also observed in gnomAD.

The columns are as follows:

  • SEQ - Chromosome for variant
  • POS - Start position of REF allele for variant (trimmed and left-normalized)
  • REF - Reference allele; following VCF convention this has at least one nucleotide even for insertions
  • ALT - Alternate allele; note that for insertions the first nucleotide is the REF allele and is not itself part of the inserted sequence
  • INS_LENGTH - Length of inserted sequence (here length of ALT minus length of REF)
  • ALLELEID - From ClinVar: Allele ID of variant; clicking on link will load page from ClinVar
  • GENEINFO - From ClinVar: symbol and ID (separated by colon) of gene impacted by variant
  • MAX_AF_ANY - The maximum allele-frequency from any population in gnomAD or TGP (maximum of GN_AF_POPMAX, EX_AF_POPMAX, and TGP_AF_POPMAX); the corresponding allele count (AC) and total allele number (AN) can be found in the columns for each dataset. Note that the variance in these estimates can be high, particulary for TGP which has smaller populations.
  • MAX_AF_WHICH - Indicates the dataset (GN, EX, or TGP) and population (three-letter code) that attains this maximum (see descriptions of other columns below for details).
  • CLNSIG - From ClinVar: clinical significance, with conflicts handles are described here: https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/
  • CLNDN - From ClinVar: disease name(s); this sometimes (but not usually) includes information on mode-of-inheritance. Some variants annotated as pathogenic in ClinVar do not have any associated diseases listed in this columnn, but clicking on the ALLELEID entry will load the ClinVar webpage for the variant, which is some cases does have information on the associated disease in the “Assertion and evidence details” section.
  • CLNHGVS - From ClinVar: HGVS notation for effect on genome (https://varnomen.hgvs.org/)
  • CLNOMIM - From Clinvar: OMIM ID(s) associated with disease, excerpted from the CLNDISDB field in the ClinVar vcf
  • OMOM_MOI - From OMIM: modes-of-inheritance associated with IDs in CLNOMIM, taken from the “Phenotype” column of the table genemap2.txt from https://omim.org/downloads/ (registration required). Abbreviations: AD: Autosomal; AR: Autosomal recessive; MF: Multifactorial; MF: Mitochondrial, XLD: X-linked dominant; DR: Digenic recessive. Note that the modes-of-inheritance listed refer to the disease with the given OMIM ID, and different variants or genes associated with the disease may have different modes.
  • RS - From ClinVar: dbSNP ID (rsID); different ALT alleles at the same postion can have the same rsID, so this does not uniquely identify the variant. Also, this rsID may not agree with that in the VEP Existing_variation column (due e.g. to merged or changed IDs, or differences in normalization of variant)
  • GNOMAD_ID - Combination of SEQ-POS-REF-ALT; clicking link loads gnomAD page for variant
  • IMPACT - From VEP: HIGH (e.g. nonsense, splice-site, frame-shift), MODERATE (e.g. missense, in-frame), LOW (e.g. synonymous), or MODIFIER predicted impact. See https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#output for details on this and other VEP fields. VEP version 85 was used, and the consequence to show is chosen with a preference for (HC > LC > no) LoF, then (HIGH > MODERATE > LOW > MODIFIER) IMPACT, then by distance to nearest gene (in none overlap), then with a preference for canonical transcripts, then in reverse alphabetical order by gene-symbol.
  • SYMBOL - From VEP: symbol for impacted gene.
  • HGVSc - From VEP: HGVS notation for effect on transcript (https://varnomen.hgvs.org/)
  • HGVSp - From VEP: HGVS notation for effect on protein (https://varnomen.hgvs.org/)
  • DISTANCE - From VEP: shortest distance from variant to transcript (blank when variant is within 8bp of exon)
  • Existing_variation - From VEP: rsID and other IDs; rsID may not agree with RS from ClinVar.
  • LoF - Predicted loss-of-function from the VEP LOFTEE plugin (HC/LC for High/Low confidence)
  • (EX|GN)_FILTER - FILTER field for variant in gnomAD exomes (EX) or genomes (GN); AC0 means no high-confidence ALT alleles were called
  • (EX|GN)_NUM_ALT - Number of distinct ALT alleles at this position in gnomAD exomes (EX) or genomes (GN); greater than one indicates multi-allelic sites, for which extra caution is needed when matching up variant IDs and allele frequencies with other databases.
  • (GN|EX|TGP)_(AC|AN|AF) – The columns *_AC and *_AN give the counts of the ALT allele (*_AC) and all alleles (*_AN) from reference datasets; their ratio gives the allele frequency (*_AF). Here the prefixes GN, EX, and TGP indicate that the data is from the gnomAD genomes, gnomAD exomes, or 1000 Genome Project phase 3 data.
  • BOTH_* , - allele counts (AC), total allele numbers (AN) and allele frequencies (AF) for the gnomAD genomes and exomes combined. Note that when a variant is observed in the exomes but not genomes these just reduce to the AC/AN/AF for the exomes; this is not really a combined AF as it doesn’t account for the number of alleles with REF basecalls in the genomes (which could be anywhere from 0 to ~30,000 — this info is not present in the vcf for unobserved variants but could perhaps be extracted from the coverage files. This is the same behavior as on the gnomAD website, and the same caveat applies to variants observed in the genomes but not the exomes.
  • *_POPMAX - The population with the highest allele frequency for this variant, and its AC/AN/AF. See http://gnomad.broadinstitute.org/faq for the meaning of the three-letter codes for populations in gnomAD and the total number of exomes and genomes included from each population. See http://www.internationalgenome.org/faq/which-populations-are-part-your-study for the three-letter codes for the 26 populations for the TGP data, most of which included approximately 100 samples (range of 61 to 113).
  • CONTEXT - Shows the following tracks, from top to bottom, lined-up so that sequence-positions correspond between tracks:
    1. SEQ_DUP shows the duplicated sequence in upper-case (including the extra copy on the variant allele), and flanking sequence in lower case.
    2. DUP_NUM labels the copies of the duplicated segment (1 for first copy, 2 for second); these positions are also color-coded yellow and cyan in the html table, but the colors are lost when exporting the table as a CSV or Excel file.
    3. Cas9_Wa shows cleavage sites for Cas9 enzymes on the Watson strand, 3 bases left of PAM starts from PAM_Wa_* columns.
    4. Cas9_Cr shows cleavage sites for Cas9 enzymes on the Crick strand, 4 bases right of PAM starts from PAM_Cr_* columns.
    5. Cpf1_Wa shows approximate cleavage sites for Cpf1 enzymes on the Watson strand, 19 bases right of PAM ends (from the PAM starts in PAM_Wa_* columns and adjusted for motif widths).
    6. Cpf1_Cr shows approximate cleavage site for Cpf1 enzymes on the Crick strand, 18 bases left of PAM ends (from the PAM starts in PAM_Cr_* and adjusted for motif widths).
    • The +/-1 base differences in shifts between Watson and Crick tracks is so that cleavage positions are to the immediate left of the indicated base in both cases (which wouldn’t be an issue if we were labelling the spaces between bases rather than the bases themselves).
    • The cleavage sites are labeled according to the Legend column in the table of PAM sequences above, with an upper-case letter if it’s the only matching PAM sequence, and a lower-case letter if it’s the first of more-then-one matching PAM sequence.
    • The Cpf1 cleavage sites are staggered on the two strands, leaving an overhang of 4-5 the double-stranded break, not indicated in these schematics
    • Motifs are scanned for in flanking regions of size 50 and the CONTEXT column includes flanking regions of size 25, so cleavage sites should be shown even if the PAM site itself does not fall within the displayed sequence (as the distance between the cleavage site and the furthest position in the PAM site is no more than 25 bases).
    • Here’s a more compact representation of the information in the CONTEXT column as an image, but embedding an image for each entry in the table bogs things down:
  • Cut_* - The cut site from the following PAM_* columns that has coordinate closest to zero (position exactly between the two copies of the duplication), separately for the Watson (Cut_Wa) and Crick (“Cut_Cr”) strands. The PAM sequence corresponding to this cut is listed in parentheses (just the first one from the PAM table if there is more one with this predicted cut position). Caution: for Cpf1s with staggered DSBs only the coordinate on the targeted strand in considered, and it may be that a Cas9 with a blunt DSB may be preferable even if further from the central position.
  • PAM_* - Predicted cut sites and start sites for the indicated PAM motif (suffix of column name) on the Watson (PAM_Wa_* ) or Crick (PAM_Cr_* ) strand. Coordinates are set zero indicates cleavage at the bond exactly between the two copies of the duplicated sequences for cuts, and zero inicates the first base in the second copy of the duplication for PAM starts. Negatives numbers are positions 5’ of this and positive numbers are positions 3’ of this, always on the Watson strand. For Cas9s, if -INS_LEN < coord < +INS_LEN then the predicted double-stranded break site is somewhere in the duplicated sequences, leaving INS_LEN - abs(coord) homologous bases on either side of the cut. For Cpf1s the situation is a bit more complicated due to the staggered cut.

Notes on table of variants:

  • There are many fewer samples in the TGP (~2,500) than in the gnomAD genomes (~15,000) or exomes (~120,000), so many rare variants seen in the gnomAD datasets are not seen at all in the TGP — one advantage of the TGP data is that it categorizes subjects into narrower populations, so may reveal alleles that have much greater frequency in specfic populations. See for example ClinVar Allele 20316 — a duplication in HPS1 associated with Hermansky-Pudlak syndrome, and which has much higher frequency in Puerto Ricans than in other TGP populations or in the broader gnomAD populations; see screengrabs from https://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/?gts=rs281865163 and http://gnomad.broadinstitute.org/variant/10-100183554-T-TGGGCCTCCCCTGCTGG below:
 

 

gnomAD exomes   gnomAD genomes

  • There do not appear to be many other variants with the same flavor as that HPS1 example, though: only two other duplications in the table below appear in TGP at all, and although those are likewise enriched in specific populations, both are duplications of length just one.
  • Clicking on a column heading will sort by that column, and filters can be applied to individual columns (sliders, multiple-choice, or text-matching, as appropriate). E.g., one can set the slider for the INS_LEN column to restrict to duplications of length 4 to 20. Click on the circled-x in a filter box to cancel a filter. The main “Search” box applies to all columns; it is case-sensitive and allows regular expressions in JavaScript syntax — e.g. entering (TCAP|HPS1|HEXA|DOK7) in the search box would match any of those four strings, and entering (M|m)usc would match Muscle, muscle, Muscular, muscular,…
  • Clicking the “CSV” or “Excel” button will save the table as a comma-separated file of an xlsx file, both of which can be opened in Excel; you may want to change the CONTEXT column to a monospace font such as “Courier”, so that when you double-click on an entry the multiple rows from the CONTEXT field can line-up properly as in the screengrab below (which may have used a different PAM-list, so the details of the tracks may differ):
 

 

 

Table 3: selected pathogenic duplications from ClinVar that are observed in gnomAD

 

   

What are we missing out on by only looking at the “coding” regions (exome_calling_regions.v1.interval_list) in the gnomAD genomes?

A lot of duplications, no doubt — perhaps around 98% of them — but it does not appear that any of these are annotated as “Pathogenic” in ClinVar. Certainly there are many variants listed in ClinVar that are not observed in either the gnomAD genomes or exomes, so are not accounted for in the table above, and this includes 2189 duplications that satisfy all the additional conditions for being in column dup2iP above. But 2183 of these are in these “coding” intervals, so if the variants had been observed at all in gnomAD they would have been reported in these vcfs. For the 6 that are not within these intervals, one can check for them in the full (not-just-coding-intervals) gnomAD and TGP vcfs — they are not observed there either. These variants are listed below; they are mainly variants in UTRs or in intronic regions >50 bases from the nearest exons (and hence not in the “coding” intervals list).

Table 4: selected pathogenic duplications from ClinVar that are outside of gnomAD “coding” regions

SEQ POS REF ALT INS_LEN ALLELEID GENEINFO CLNSIG CLNDN OMIM_MOI
2 47656744 G GTGAGCCACTGCGCCCAGCA… 454 95998 MSH2:4436 Pathogenic Lynch syndrome
10 86018468 A AG 1 431743 RGR:5995 Pathogenic Retinal dystrophy
17 1665408 G GT 1 50330 SERPINF1:5176 Pathogenic Osteogenesis imperfecta, type VI
17 29554163 G GAGCTTATCAGGTTCTCCAT… 337 213707 NF1:4763 Pathogenic Neurofibromatosis, type 1 AD
17 29556697 T TGGGTACGAGTGTCTGCGTA… 412 213709 NF1:4763 Pathogenic Neurofibromatosis, type 1 AD
20 18038617 A ACCGGTTCCGGCGGCCGGGG… 22 226677 OVOL2:58495 Pathogenic Posterior polymorphous corneal dystrophy 1 AD

What other pathogenic duplicates in ClinVar might we be missing when looking at gnomAD/TGP for allele frequencies?

It wouldn’t be surprising to miss out on variants that are extremely rare in general, or even not-terribly-rare variants that are concentrated in populations without many samples: with only ~100 subjects per population in the TGP data one would expect to miss out on ~13% of alleles with frequency 0.01 in these populations. And a few other possibilities:

  • Subjects known to have severe pediatric disease were not included in the gnomAD dataset, so variants that cause these diseases may be under-represented, in particular those with dominant inheritance.

  • About 8% of the genome was masked during the gnomAD variant calling (e.g. some repetitive sequence), so any ClinVar variants the fall in these regions will not be reported in gnomAD. But it appears that only one of the ~13000 insertions from ClinVar falls in one of these masked region — a benign variant in SHOX1 which is masked since it’s in a PAR on the Y chromosome.

  • gnomAD doesn’t report variants on the Y chromosome at all, whether in masked regions or not. But the only “Pathogenic” duplication on the Y chromosome in ClinVar is a single-base insertion, Y:2655380:C/CT in the gene SRY: https://www.ncbi.nlm.nih.gov/clinvar/variation/470195/

  • The longest insertion reported in the gnomAD coding regions has length 621, and there are 1431 insertions of length at least 100, but it’s possible that the detection sensitivity may decline for longer insertions. Below is a table of the 9 “Pathogenic”" duplications with length at least 100 from ClinVar that satisfy the conditions of the column dup2iP above, none of which are observed in gnomAD.

 

Table 5: selected pathogenic duplications of length at least 100 from ClinVar

SEQ POS REF ALT INS_LEN ALLELEID GENEINFO CLNSIG CLNDN OMIM_MOI
2 47656744 G GTGAGCCACTGCGCCCAGCA… 454 95998 MSH2:4436 Pathogenic Lynch syndrome
2 145156601 T TTGGGAGCTAACGGCTTGGA… 115 448705 ZEB2:9839 Pathogenic Mowat-Wilson syndrome AD
2 200298058 T TACCTTGGGCCTGGGCCGCA… 177 214752 SATB2:23314 Pathogenic Chromosome 2q32-q33 deletion syndrome AD
3 138665088 G GTGCGCGGGCGGCGGCCGGA… 124 354130 FOXL2:668 Pathogenic Blepharophimosis, ptosis, and epicanthus inversus AD
7 150644477 C CCGGGCTGGAGAGGGGGATG… 275 424894 KCNH2:3757 Pathogenic Long QT syndrome 2 AD
17 29554163 G GAGCTTATCAGGTTCTCCAT… 337 213707 NF1:4763 Pathogenic Neurofibromatosis, type 1 AD
17 29556697 T TGGGTACGAGTGTCTGCGTA… 412 213709 NF1:4763 Pathogenic Neurofibromatosis, type 1 AD
17 41251791 C CCCAATTCAATGTAGACAGA… 1058 262984 BRCA1:672 Pathogenic Breast-ovarian cancer, familial 1 AD/MF
X 25023006 G GGGGGCCATTGTGGAAAAGA… 103 209011 ARX:170302 Pathogenic Lissencephaly 2, X-linked

 

Below is a table of “simple” duplications of length 2-40 that are annotated in ClinVar as “Pathogenic” or “Pathogenic/Likely pathogenic”, and are associated in ClinVar with a disease annotated as autosomal dominant (AD) in OMIM, but which do not appear in gnomAD:

Table 6: selected pathogenic duplications from ClinVar that are not observed in gnomAD

 

End