---
title: # "Microduplications"
author: # Oliver King, oliver.king@umassmed.edu
date: # "04/25/2018, edited on 03/06/2019 to e.g. update URLs that have changed"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(width=200)
library(knitr)
# source("https://bioconductor.org/biocLite.R"); biocLite("Biostrings")
library(Biostrings)
# install.packages("DT")
library(DT)
DT:::DT2BSClass(c('compact', 'cell-border')) ## check
options(warn=-1) # suppresses warning about too-big-table.
```
## Website accompanying "Precise therapeutic gene correction by a simple nuclease-induced double-strand break"
*Note: The expanded searchable/sortable/filterable version of Supplementary Table 3 from the manuscript appears as Table 3 below.*
### Links to tables that appear below
* [Table 1: counts of duplications in gnomAD coding regions](#anch1)
* [Table 2: PAM sequences scanned for in duplications and flanking sequence](#anch2)
* [Table 3: **selected pathogenic duplications from ClinVar that are observed in gnomAD (searchable)**](#anch3)
* [Table 4: selected pathogenic duplications from ClinVar that are outside of gnomAD "coding" regions](#anch4)
* [Table 5: selected pathogenic duplications of length at least 100 from ClinVar](#anch5)
* [Table 6: selected pathogenic duplications from ClinVar that are not observed in gnomAD (searchable)](#anch6)
```{r load, include=F}
## from combine_annotations_v1.R
load("itab_GET_OTH.RData");
itab$TANDEM_N = as.numeric(itab$TANDEM_N)
itab$SHIFT_SYMM = as.numeric(itab$SHIFT_SYMM)
table(itab$EX_AF > itab$EX_AF_POPMAX)
table(!is.na(itab$EX_AF) & (itab$EX_AF > 0) & is.na(itab$EX_AF_POPMAX)) # 16 T
table(!is.na(itab$GN_AF) & (itab$GN_AF > 0) & is.na(itab$GN_AF_POPMAX)) # 22 T
table(itab$EX_AF > itab$EX_AF_POPMAX) # 0 T
table(itab$GN_AF > itab$GN_AF_POPMAX) # 2 T
table(is.na(itab$EX_AF) & !is.na(itab$EX_AF_POPMAX)) # 112 T
table(is.na(itab$GN_AF) & !is.na(itab$GN_AF_POPMAX)) # 36 T
check.me = which(!is.na(itab$GN_AF) & (itab$GN_AF > 0) & is.na(itab$GN_AF_POPMAX)) # 22 T
itab[check.me,] ## OTHER!
nflank = 50; ## for motif scanning
dflank = 25; ## for display
## from count_duplications_v1.R
load("counts_by_chrom_GE.RData")
countGE = count.list[[1]];
for (k in 2:length(count.list)) countGE = countGE + count.list[[k]]
## from vt and bvftools output
# bcftools view -H gnomad.genomes.r2.0.2.sites.coding_only.chr1-22.vtdn.insertions.hai.cv.vcf.gz | wc -l
# 142459
# bcftools view -H gnomad.genomes.r2.0.2.sites.coding_only.chrX.vtdn.insertions.hai.cv.vcf.gz | wc -l
# 3433
# bcftools view -H gnomad.exomes.r2.0.2.sites.vtdn.insertions.hai.cv.vcf.gz | wc -l
# 414576
sstats=list("gnA.all"=4743583, "gnX.all"=107555, "ex.all"=17009588,
"gnA.ins"=142459, "gnX.ins"=3433, "ex.ins"=414576)
sstats$gn.all = sstats$gnA.all + sstats$gnX.all;
sstats$gn.ins = sstats$gnA.ins + sstats$gnX.ins;
```
### Overview
The results below for the most part are based on the files of "coding" variants from gnomAD genomes and exomes, version 2.0.2 (https://console.cloud.google.com/storage/browser/gnomad-public/release/2.0.2/)
- gnomad.genomes.r2.0.2.sites.coding_only.chr1-22.vcf
- gnomad.genomes.r2.0.2.sites.coding_only.chrX.vcf
- gnomad.exomes.r2.0.2.sites.vcf
These consists of all variants in the intervals used for ExAC (ftp://ftp.broadinstitute.org/pub/ExAC_release/release1/resources/)
- exome_calling_regions.v1.interval_list
Most of these intervals correspond to exons plus 50 flanking bases on each side, and they collectively cover 60 million bases,
about 2% of the genome. Note that there are no variant calls for the Y chromosome, and these are not strictly all coding
variants, as some are in introns, UTRs, miRNA, ncRNA.
The 1000 Genome Project data was taken from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.
The vcf files there include precomputed allele-frequencies for only five broad super-populations;
the allele-frequencies for 26 more-specific populations computed from the per-individuals genotypes in the vcf files,
aggregated using the population assignments from the file integrated_call_samples_v3.20130502.ALL.panel .
The ClinVar annotations were taken from the file ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_2.0/2018/clinvar_20180225.vcf.gz
Note that matching up variants from two different sources (e.g. ClinVar and gnomAD) can sometimes be tricky,
particularly for indels and multi-allelic sites, since the same variant may have multiple representations.
(The gnomAD vcf files themselves have ClinVar annotations included, as part of the Ensembl VEP
output, but this seems to have many spurious or missing or out-of-date annotations; likewise for
the ExAC and TGP allele frequecies included in the ClinVar vcf file.) Here the variants have been
decomposed and normalized (trimmed and left-aligned) ([vt](https://github.com/atks/vt) 0.5772) for the
purpose of matching them up ([bcftools](https://github.com/samtools/bcftools) 1.9) but the HGNC notation used at
ClinVar may follow the right-aligned (3'-most position) convention, in which duplications are taken to occur immediately
after the repeated sequence rather than immediately before the repeated sequence.
The gnomAD _genome_ files above contain a total of `r sprintf("%d", sstats$gn.all)` distinct variant alleles,
of which `r sprintf("%d", sstats$gn.ins)` (~`r round(100*sstats$gn.ins/sstats$gn.all,1)`%) are insertions.
The gnomAD _exome_ files above contain a total of `r sprintf("%d", sstats$ex.all)` distinct variant alleles,
of which `r sprintf("%d", sstats$ex.ins)` (~`r round(100*sstats$ex.ins/sstats$ex.all,1)`%) are insertions.
Note that many of these variants are common to both the exomes and genomes, but in the tables below variants
that occur in both are counted only once.
This table focuses on the insertions, and in particular the duplications. The second column (**insertions**)
gives the counts of all the distinct insertion variant alleles, binned by the length of the
insertion (**length**), with all variants of length at least 40 combined into one bin. Subsequent columns give
the number of variants that satify additional criteria, as follows:
* **dup**: the insertion is an exact duplication of the immediately adjacent sequence in the GRCh37 reference genome
(immediately 3' with this normalization). Note that there may be polymorphisms in this adjacent sequqnce that affect
whether an insertion is indeed a perfect duplication for any given individual.
* **dup2**: the insertion does not add a repeat-unit to what is already a (two-or-more unit) tandem repeat in the
reference genome. This eliminates e.g. the duplication of CCCGGG in RAX2, as the reference genome already has two
immediately adjacent (3') tandem copies of this (screengrab from https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=rs549932754 below)
![ ](RAX2_screengrab.png)
* **dup2i**: the insertion satisfies the previous constraints and is not itself a perfect tandem repeat (e.g., for a
duplicated six-mer, it is not of the form XXXXXX, XYXYXY or XYZXYZ). Note that even if a duplicated sequence is not
itself a perfect tandem repeat it may contain internal tandem repeats -- e.g. the AGGAGG in the duplicated AAGGAGGATC
in _NCF4_ --- so depending on where the Cas9 cleavage site is this may need to be considered, to prevent a shorter
internal microduplication from being collapsed instead of the full duplication.
* **dup2iC**: the variant satifies the previous constraints and is also listed in ClinVar.
* **dup2iL**: the variant satifies the previous constraints and is reported in Clinvar as "Pathogenic", "Pathogenic/Likely_pathogenic", "Likely_pathogenic" or "Conflicting_interpretations_of_pathogenicity"
* **dup2iP**: the variant satifies the previous constraints and is reported in Clinvar as "Pathogenic" or "Pathogenic/Likely_pathogenic"
"Simple" duplications: We will refer to those duplications that satisfy conditions **dup2** and **dup2i**
as "simple" duplications. Although neither condition is strictly necessary for using a
Microhomology-Mediated End-Joining (MMEJ) strategy for correction, they can reduce some potential
difficulties with the nuclease cleaving the wild-type or already-corrected alleles as well as the
targeted duplication-containing allele(s), or having multiple repair products with different
numbers of repeats or sub-repeats collapsed. Note that these conditions exclude repeat-expansion
disorders such as the polyQ/CAG expansion in Huntington's disease, and the _C9orf72_ hexanucleotide
expansion in FTD and/or ALS, in which shortening of the expansions to below critical thresholds
may be more relevant than repair to a single precise wild-type length:
- https://www.ncbi.nlm.nih.gov/clinvar/variation/31915/
- https://www.ncbi.nlm.nih.gov/clinvar/variation/31151/
- https://www.ncbi.nlm.nih.gov/clinvar/variation/183034/
(The three variants above do not appear to be included in the ClinVar vcf at all --- the ClinVar
webpages assign them variant type "Microsatellite", some of which are included in the vcf, so perhaps
their absense is due to the lack of a single fixed-length ALT allele in their HGVS descriptions.)
Simple duplications were identified using a modified version of the function `annotate_indels` from the
[vt](https://github.com/atks/vt) tool set; this and other custom code can be found [here](./code/).
#### Table 1: counts of duplications in gnomAD coding regions {#anch1}
```{r table, echo=F}
dd = data.frame(length=c(1:39, "40+"), countGE[1:40,])
dd[40,-1] = colSums(countGE[-c(1:39),])
kable(dd, row.names=F)
```
Here are the totals from the table above:
```{r total, echo=F}
ee = dd
ee[,1]=as.character(ee[,1])
cc = colSums(ee[,-1])
ee[1,-1] = cc;
ee[1,1]= "all";
kable(ee[1,, drop=F], row.names=F)
check.me.in.callset = which(itab$IN_CALLSET==1 &
grepl("Path", itab$CLNSIG) &
itab$SHIFT_SYMM==0 &
itab$TANDEM_N==1 &
is.na(itab$GN_AN) &
is.na(itab$EX_AN))
check.me.off.callset = which(itab$IN_CALLSET==0 &
grepl("Path", itab$CLNSIG) &
itab$SHIFT_SYMM==0 &
itab$TANDEM_N==1 &
is.na(itab$GN_AN) &
is.na(itab$EX_AN))
check.me.long = which(itab$INS_LEN >=100 &
grepl("Path", itab$CLNSIG) &
itab$SHIFT_SYMM==0 &
itab$TANDEM_N==1)
# table(itab$INS_LEN[check.me0])
```
```{r bar, echo=F, fig.width=9, fig.height=4}
## some precomputed counts
load("bpmats.RData")
```
Below is a barplot illustrating the stratification of the `r format(sum(tab.mat2$insertion), scientific=F)` insertions
from the table above into progressively finer subcategories. For simplicity the two levels restricting
to **dup2** and **dup2i** have been combined into a single levels restricting to "simple" duplications, and the
level **dup2iL** has been omitted between the restrictions to variants listed in ClinVar (**dup2iC**) and to
variants listed as Pathogenic or Pathogenic/Likely_pathogenic in ClinVar (**dup2iP**).
````{r bar2, echo=F, fig.width=9, fig.height=6}
mxi = nrow(tab.diff);
par(mar=c(4,5,1.5,1), mgp=c(2.8,0.5,0), tcl=-0.25)
bp = barplot(data.matrix(t(tab.diff2[,6:2])), ylim=1.1*c(0, tab.mat2[1,2]), las=2,
col=rev(c("grey", "red", "orange", "cyan", "green")), border=NA,
names.arg=c(1:(mxi-1), paste0(mxi,"+")), cex.names=0.8, cex.axis=0.8,
xlab="", ylab="Number of variants", xlim=c(-1, 46.3+1), xaxs="i")
mtext("Insertions from gnomAD coding regions", line=0, font=2, cex=1.2)
title(xlab="Length of microduplication (bp)", mgp=c(2,0.5,0))
legend("topright", legend=rev(c("pathogenic in ClinVar", "listed in ClinVar",
"\"simple\" duplication",
"duplication", "insertion")),
border=NA, bty="n", fill=c("grey", "red", "orange","cyan","green"), cex=1)
```
Below is a more "ClinVar-centric" view of the insertions, beginning with the `r sum(tab.mat$insertion)`
insertions annotated as "Pathogenic" or "Pathohgenic/Likely_pathogenic" in ClinVar, and stratifying them
into progressively finer subcategories, arriving at the final level --- those variants also observed at
least once in gnomAD exome or genome "coding" regions --- with the same set of duplications at the final
level above.
````{r bar3, echo=F, fig.width=9, fig.height=6}
par(mar=c(4,5,1.5,1), mgp=c(2.8,0.5,0), tcl=-0.25)
bp = barplot(data.matrix(t(tab.diff[,5:2])), ylim=1.1*c(0, tab.mat[1,2]), las=2,
col=rev(c("grey", "red", "orange", "green")), border=NA,
names.arg=c(1:(mxi-1), paste0(mxi,"+")), cex.names=0.8, cex.axis=0.8,
xlab="", ylab="Number of variants", xlim=c(-1, 46.3+1), xaxs="i")
mtext("Pathogenic insertions from ClinVar", line=0, font=2, cex=1.2)
title(xlab="Length of microduplication (bp)", mgp=c(2,0.5,0))
legend("topright", legend=rev(c("observed in gnomAD", "\"simple\" duplication",
"duplication", "insertion")),
border=NA, bty="n", fill=c("grey", "red", "orange", "green"), cex=1)
```
(Note that the two plots above differ from Extended Data Fig 10 in the manuscript in that they also include
insertions of length 1. Note also that the counts of pathogenic insertions dip for lengths 3, 6, 9 ...,
as expected since multiples of three may result in in-frame rather than frame-shift mutations to proteins.)
```{r pam0, echo=F}
## TODO: include table
PAM.tab = read.table("~/Dropbox/microduplications/Cas9_and_Cpf1_cleavage_information_v2.txt",
sep="\t", header=T, quote="", stringsAsFactors=F)
colnames(PAM.tab)=gsub(".","_", colnames(PAM.tab), fixed=T)
# IUPEC from https://www.bioinformatics.org/sms/iupac.html
expandIUPEC = function(x) {
x=gsub("R", "[AG]", x);
x=gsub("Y", "[CT]", x);
x=gsub("S", "[CG]", x);
x=gsub("W", "[AT]", x);
x=gsub("K", "[GT]", x);
x=gsub("M", "[AC]", x);
x=gsub("B", "[CGT]", x);
x=gsub("D", "[AGT]", x);
x=gsub("H", "[ACT]", x);
x=gsub("V", "[ACG]", x);
x=gsub("N", "[ACGT]",x);
x
}
# PAMrx = expandIUPEC(PAM.list);
# names(PAMrx)=names(PAM.list)
```
### PAM sequences
Below are the PAM sequences that are scanned for, taken from the table at https://www.addgene.org/crispr/guide/#pam-table and
from Hu et al (Nature 2018; doi:10.1038/nature26155) [see also http://blog.addgene.org/xcas9-engineering-a-crispr-variant-with-pam-flexibility].
In the column **Expanded** the IUPAC codes in the column **Pattern** are expanded into a regular expression, with
multiple allowed bases enclosed by brackets. (Note that the R regular expression scanning function `gregexpr`
can return multiple hits per scanned sequence, but will not return overlapping hits; this uses a modified version,
which will return the positions of all matches, even if they overlap.) The column **Legend** shows the characters
used to indicate start positions of PAM sequences in sequence schematics; upper-case characters indicate that
this is the only PAM sequence beginning at the position, and lower-case characters indicate that there are
additional PAM sequences beginning at the position (but only the character for the first is shown). The
column **Side** indicates to which side of the guide RNA the PAM sequence is located.
#### Table 2: PAM sequences scanned for in duplications and flanking sequence {#anch2}
```{r pam, echo=F}
# skip L/l to avoid confusion with I/i
pam.df = data.frame(Legend=paste(LETTERS[-12][1:nrow(PAM.tab)],letters[-12][1:nrow(PAM.tab)], sep="/"),
Species_and_Variant_of_Cas9 = PAM.tab$Species_and_Variant_of_Cas9,
Side=ifelse(grepl("3'", PAM.tab$PAM_Sequence, fixed=T), "3'", "5'"),
Pattern=sub(" or ", "|", sub("3' |5' ", "", PAM.tab$PAM_Sequence)))
pam.df$Expanded = expandIUPEC(pam.df$Pattern)
pam.df$CleavageSite=ifelse(grepl("Cpf1", pam.df$Species_and_Variant_of_Cas9), "approx +18 from end", "-3 from start")
pam.df$Width = nchar(sub("\\|.*$", "", pam.df$Pattern)) ## NOTE: assumes all have same width for patterns with |
kable(pam.df, row.names=F)
write.table(pam.df, "pam_table.txt", sep="\t", quote=F, row.names=F)
```
```{r dummy}
```
#### Notes on PAM sequences:
* In the table above the rows for the AsCpf1 RR variant and the LbCpf1 RR variant are combined
as their PAM sequences in the table at Addgene are identical.
* The entry for SpCas9 D1135E variant with PAM sequence "3' NGG (reduced NAG binding)" is omitted, since the NGG is
identical to the usual SpCas9 PAM, although one could scan for N[GA]G instead to pick up the the weaker NAG binding as well.
* For SpCas9 VQR variant the pattern picks up either of the motifs in the Addgene PAM sequence "3' NGAN or NGNG".
* For SaCas9 the Addgene PAM sequence is "3' NNGRRT or NNGRR(N)" --- here just the first of these is used, as the
second pattern would subsume the former; according the Addgene manual the efficiency is best for T but other bases
can be consisdered when evaluating off-target effects.
* The predicted cleavage sites are 3 bases before the start of the PAM for Cas9, and 18 bases after the end of the
PAMs for Cpf1s --- these Cpf1 cleavage sites may vary slightly, and the non-targeting strand is cleaved further
from the PAM, leaving a 4-5nt overhang (Zetche et al, Cell 2015; doi:10.1016/j.cell.2015.09.038)
* The letters L/l are not used in the Legend column to avoid confusion with I/i for some fonts.
### Selected duplications that are in gnomAD as well as ClinVar
Below is a table of the duplication variants from the final column (**dup2iP**) of the table above, which are
annotated as "Pathogenic" or "Pathogenic/Likely_pathogenic" in ClinVar and are observed in the gnomAD exome
(**EX_** columns) or genome (**GN_** columns) coding regions. Allele frequencies from the 1000 Genome Project
(**TGP_** columns) are also included, but only a few pathogenic duplications were oberved there, all of
which were also observed in gnomAD.
The columns are as follows:
* **SEQ** - Chromosome for variant
* **POS** - Start position of **REF** allele for variant (trimmed and left-normalized)
* **REF** - Reference allele; following VCF convention this has at least one nucleotide even for insertions
* **ALT** - Alternate allele; note that for insertions the first nucleotide is the **REF** allele and is not itself part of the inserted sequence
* **INS\_LENGTH** - Length of inserted sequence (here length of **ALT** minus length of **REF**)
* **ALLELEID** - From ClinVar: Allele ID of variant; clicking on link will load page from ClinVar
* **GENEINFO** - From ClinVar: symbol and ID (separated by colon) of gene impacted by variant
* **MAX\_AF\_ANY** - The maximum allele-frequency from any population in gnomAD or TGP (maximum of **GN\_AF\_POPMAX**, **EX\_AF\_POPMAX**, and **TGP\_AF\_POPMAX**); the corresponding allele count (AC) and total allele number (AN) can be found in the columns for each dataset. Note that the variance in these estimates can be high, particulary for TGP which has smaller populations.
* **MAX\_AF\_WHICH** - Indicates the dataset (GN, EX, or TGP) and population (three-letter code) that attains this maximum (see descriptions of other columns below for details).
* **CLNSIG** - From ClinVar: clinical significance, with conflicts handles are described here: https://www.ncbi.nlm.nih.gov/clinvar/docs/clinsig/
* **CLNDN** - From ClinVar: disease name(s); this sometimes (but not usually) includes information on mode-of-inheritance. Some variants annotated as pathogenic in ClinVar do not have any associated diseases listed in this columnn, but clicking on the **ALLELEID** entry will load the ClinVar webpage for the variant, which is some cases does have information on the associated disease in the "Assertion and evidence details" section.
* **CLNHGVS** - From ClinVar: HGVS notation for effect on genome (https://varnomen.hgvs.org/)
* **CLNOMIM** - From Clinvar: OMIM ID(s) associated with disease, excerpted from the CLNDISDB field in the ClinVar vcf
* **OMOM\_MOI** - From OMIM: modes-of-inheritance associated with IDs in **CLNOMIM**, taken from the "Phenotype" column of the table genemap2.txt from https://omim.org/downloads/ (registration required). Abbreviations: AD: Autosomal; AR: Autosomal recessive; MF: Multifactorial; MF: Mitochondrial, XLD: X-linked dominant; DR: Digenic recessive. Note that the modes-of-inheritance listed refer to the disease with the given OMIM ID, and different variants or genes associated with the disease may have different modes.
* **RS** - From ClinVar: dbSNP ID (rsID); different ALT alleles at the same postion can have the same rsID, so this does not uniquely identify the variant. Also, this rsID may not agree with that in the VEP **Existing\_variation** column (due e.g. to merged or changed IDs, or differences in normalization of variant)
* **GNOMAD\_ID** - Combination of **SEQ**-**POS**-**REF**-**ALT**; clicking link loads gnomAD page for variant
* **IMPACT** - From VEP: HIGH (e.g. nonsense, splice-site, frame-shift), MODERATE (e.g. missense, in-frame), LOW (e.g. synonymous), or MODIFIER predicted impact. See https://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#output for details on this and other VEP fields. VEP version 85 was used, and the consequence to show is chosen with a preference for (HC > LC > no) LoF, then (HIGH > MODERATE > LOW > MODIFIER) IMPACT, then by distance to nearest gene (in none overlap), then with a preference for canonical transcripts, then in reverse alphabetical order by gene-symbol.
* **SYMBOL** - From VEP: symbol for impacted gene.
* **HGVSc** - From VEP: HGVS notation for effect on transcript (https://varnomen.hgvs.org/)
* **HGVSp** - From VEP: HGVS notation for effect on protein (https://varnomen.hgvs.org/)
* **DISTANCE** - From VEP: shortest distance from variant to transcript (blank when variant is within 8bp of exon)
* **Existing\_variation** - From VEP: rsID and other IDs; rsID may not agree with **RS** from ClinVar.
* **LoF** - Predicted loss-of-function from the VEP LOFTEE plugin (HC/LC for High/Low confidence)
* **\(EX|GN)\_FILTER ** - FILTER field for variant in gnomAD exomes (**EX**) or genomes (**GN**); AC0 means no high-confidence ALT alleles were called
* **\(EX|GN)\_NUM_ALT ** - Number of distinct ALT alleles at this position in gnomAD exomes (**EX**) or genomes (**GN**); greater than one indicates multi-allelic sites, for which extra caution is needed when matching up variant IDs and allele frequencies with other databases.
* **\(GN|EX|TGP)\_(AC|AN|AF) ** -- The columns **\*\_AC** and **\*\_AN** give the counts of the ALT allele (**\*\_AC**) and all alleles (**\*\_AN**) from reference datasets; their ratio gives the allele frequency (**\*\_AF**). Here the prefixes **GN**, **EX**, and **TGP** indicate that the data is from the gnomAD genomes, gnomAD exomes, or 1000 Genome Project phase 3 data.
* **BOTH_* **, - allele counts (**AC**), total allele numbers (**AN**) and allele frequencies (**AF**) for the gnomAD genomes and exomes combined. Note that when a variant is observed in the exomes but not genomes these just reduce to the AC/AN/AF for the exomes; this is not really a combined AF as it doesn't account for the number of alleles with REF basecalls in the genomes (which could be anywhere from 0 to ~30,000 --- this info is not present in the vcf for unobserved variants but could perhaps be extracted from the coverage files. This is the same behavior as on the gnomAD website, and the same caveat applies to variants observed in the genomes but not the exomes.
* **\*\_POPMAX** - The population with the highest allele frequency for this variant, and its **AC/AN/AF**. See http://gnomad.broadinstitute.org/faq for the meaning of the three-letter codes for populations in gnomAD and the total number of exomes and genomes included from each population. See http://www.internationalgenome.org/faq/which-populations-are-part-your-study for the three-letter codes for the 26 populations for the TGP data, most of which included approximately 100 samples (range of 61 to 113).
* **CONTEXT** - Shows the following tracks, from top to bottom, lined-up so that sequence-positions correspond between tracks:
1. SEQ_DUP shows the duplicated sequence in upper-case (including the extra copy on the variant allele), and flanking sequence in lower case.
2. DUP_NUM labels the copies of the duplicated segment (1 for first copy, 2 for second); these positions are also color-coded yellow and cyan in the html table, but the colors are lost when exporting the table as a CSV or Excel file.
3. Cas9_Wa shows cleavage sites for Cas9 enzymes on the Watson strand, 3 bases left of PAM starts from **PAM_Wa_* ** columns.
4. Cas9_Cr shows cleavage sites for Cas9 enzymes on the Crick strand, 4 bases right of PAM starts from **PAM_Cr_* ** columns.
5. Cpf1_Wa shows approximate cleavage sites for Cpf1 enzymes on the Watson strand, 19 bases right of PAM ends (from the PAM starts in **PAM_Wa_* ** columns and adjusted for motif widths).
6. Cpf1_Cr shows approximate cleavage site for Cpf1 enzymes on the Crick strand, 18 bases left of PAM ends (from the PAM starts in **PAM_Cr_* ** and adjusted for motif widths).
- The +/-1 base differences in shifts between Watson and Crick tracks is so that cleavage positions are to the immediate left of the indicated base in both cases (which wouldn't be an issue if we were labelling the spaces between bases rather than the bases themselves).
- The cleavage sites are labeled according to the **Legend** column in the table of PAM sequences above, with an upper-case letter if it's the only matching PAM sequence, and a lower-case letter if it's the first of more-then-one matching PAM sequence.
- The Cpf1 cleavage sites are staggered on the two strands, leaving an overhang of 4-5 the double-stranded break, not indicated in these schematics
- Motifs are scanned for in flanking regions of size `r nflank` and the **CONTEXT** column includes flanking regions of size `r dflank`, so cleavage sites should be shown even if the PAM site itself does not fall within the displayed sequence (as the distance between the cleavage site and the furthest position in the PAM site is no more than 25 bases).
- Here's a more compact representation of the information in the **CONTEXT** column as an image, but embedding an image for each entry in the table bogs things down:
![](temp_pam_TCAP.png){width=650px}
* **Cut_* ** - The cut site from the following **PAM_* ** columns that has coordinate closest to zero (position exactly between the two copies of the duplication), separately for the Watson (**Cut_Wa**) and Crick ("**Cut_Cr**") strands. The PAM sequence corresponding to this cut is listed in parentheses (just the first one from the PAM table if there is more one with this predicted cut position). Caution: for Cpf1s with staggered DSBs only the coordinate on the targeted strand in considered, and it may be that a Cas9 with a blunt DSB may be preferable even if further from the central position.
* **PAM_* ** - Predicted cut sites and start sites for the indicated PAM motif (suffix of column name) on the Watson (**PAM_Wa_* **) or Crick (**PAM_Cr_* **) strand. Coordinates are set zero indicates cleavage at the bond exactly between the two copies of the duplicated sequences for cuts, and zero inicates the first base in the second copy of the duplication for PAM starts. Negatives numbers are positions 5' of this and positive numbers are positions 3' of this, always on the Watson strand. For Cas9s, if -**INS_LEN** < coord < +**INS_LEN** then the predicted double-stranded break site is somewhere in the duplicated sequences, leaving **INS_LEN** - abs(coord) homologous bases on either side of the cut. For Cpf1s the situation is a bit more complicated due to the staggered cut.
#### Notes on table of variants:
* There are many fewer samples in the TGP (~2,500) than in the gnomAD genomes (~15,000) or exomes (~120,000), so many rare
variants seen in the gnomAD datasets are not seen at all in the TGP --- one advantage of the TGP data is that it
categorizes subjects into narrower populations, so may reveal alleles that have much greater frequency in specfic
populations. See for example ClinVar Allele 20316 --- a duplication in *HPS1* associated with Hermansky-Pudlak
syndrome, and which has much higher frequency in Puerto Ricans than in other TGP populations or in the broader
gnomAD populations; see screengrabs from https://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/?gts=rs281865163 and
http://gnomad.broadinstitute.org/variant/10-100183554-T-TGGGCCTCCCCTGCTGG below:
![ ](TGP_HPS1_screengrab.png){width=650px}
![gnomAD exomes](HPS1_gnomad_exomes.png){width=350px} ![gnomAD genomes](HPS1_gnomad_genomes.png){width=350px}
* There do not appear to be many other variants with the same flavor as that *HPS1* example, though: only two other
duplications in the table below appear in TGP at all, and although those are likewise enriched in specific populations,
both are duplications of length just one.
* Clicking on a column heading will sort by that column, and filters can be applied to individual columns (sliders,
multiple-choice, or text-matching, as appropriate). E.g., one can set the slider for the **INS_LEN** column to
restrict to duplications of length 4 to 20. Click on the circled-x in a filter box to cancel a filter. The main
"Search" box applies to all columns; it is case-sensitive and allows regular expressions in JavaScript syntax ---
e.g. entering **(TCAP|HPS1|HEXA|DOK7)** in the search box would match any of those four strings, and
entering **(M|m)usc** would match Muscle, muscle, Muscular, muscular,...
* Clicking the "CSV" or "Excel" button will save the table as a comma-separated file of an xlsx file,
both of which can be opened in Excel; you may want to change the **CONTEXT** column to a monospace font
such as "Courier", so that when you double-click on an entry the multiple rows from the **CONTEXT** field
can line-up properly as in the screengrab below (which may have used a different PAM-list, so the
details of the tracks may differ):
![ ](excel_context_screengrab.png)
* (The above relies on Excel preferentially wrapping the text on white-spaces; you may need to
fiddle with how far this column is from the right-edge of the Excel window to keep it from introducing
too few or too many line-breaks; you could also try replacing the semicolons by line-breaks in Excel, https://answers.microsoft.com/en-us/msoffice/forum/msoffice_excel-mso_other-mso_2007/how-can-i-replace-commas-with-a-line-feed-in-a/55390740-16a4-48c8-85d8-c00e5ded2ed2, but how this works may depend on your operating system, and with my version of
Excel it also triggers undesireable within-cell word-wrapping)
#### Table 3: selected pathogenic duplications from ClinVar that are observed in gnomAD {#anch3}
```{r summary, echo=F}
show.row.old = which((itab$TANDEM_N==1) &
(itab$SHIFT_SYMM==0) &
grepl("Path", itab$CLNSIG) &
(!is.na(itab$GN_AN) | !is.na(itab$EX_AN)))
show.row.test = which((itab$TANDEM_N>=1) &
(itab$SHIFT_SYMM>=0) &
grepl("Path", itab$CLNSIG) &
(!is.na(itab$GN_AN) | !is.na(itab$EX_AN)))
show.row.2 = which((itab$TANDEM_N == 1) &
(itab$SHIFT_SYMM == 0) &
grepl("Path", itab$CLNSIG) &
(itab$INS_LEN >= 2) &
(itab$INS_LEN <= 40) &
((itab$GN_AC > 0) | (itab$EX_AC > 0)))
show.row = which((itab$TANDEM_N == 1) &
(itab$SHIFT_SYMM == 0) &
grepl("Path", itab$CLNSIG) &
(itab$INS_LEN >= 1) &
(itab$INS_LEN <= 1000) &
((itab$GN_AC > 0) | (itab$EX_AC > 0)))
show.row = show.row[order(nchar(itab$ALT[show.row]))]
drop.col = c("CLNVC", "FLANKSEQ", "EX_THIS_ALT", "GN_THIS_ALT","AF_ESP","AF_EXAC","AF_TGP","IN_CALLSET","MASKED",
"GN_AC_OTH", "GN_AN_OTH", "GN_AF_OTH", "EX_AC_OTH", "EX_AN_OTH", "EX_AF_OTH")
csq.cols = c("IMPACT","SYMBOL","Consequence", "HGVSc", "HGVSp", "DISTANCE", "Existing_variation", "LoF")
res = cbind(itab[show.row, ], csq.tab[show.row, csq.cols])
res$CLNDN = gsub("_", " ", gsub("|", " | ", gsub(";", "; ", res$CLNDN), fixed=T), fixed=T)
res$CLNSIG = gsub("/", " / ", res$CLNSIG, fixed=T)
res$GENEINFO = gsub("|", " | ", res$GENEINFO, fixed=T)
res$CLNOMIM = gsub("|", " | ", res$CLNOMIM, fixed=T)
res$EX_NUM_ALT = pmax(res$EX_NUM_ALT, 0)
res$GN_NUM_ALT = pmax(res$GN_NUM_ALT, 0)
res$Consequence = gsub("&", " & ", res$Consequence, fixed=T)
res$Existing_variation = gsub("&", " & ", res$Existing_variation, fixed=T)
# res$ID[!grepl("rs", res$ID)]="";
# res$AF = sprintf("%0.5f", res$AF)
flank.split = strsplit(res$FLANKSEQ, "\\[|\\]") # UP, REF, DN
# table(nchar(sapply(flank.split, "[[", 1)))
## all 60
# table(nchar(sapply(flank.split, "[[", 2)))
# table(nchar(sapply(flank.split, "[[", 3)) - 2*(nchar(res$ALT)-1))
## all 61
res$CONTEXT = mapply(FUN=function(x,y) paste0(tolower(substr(x[1],
nchar(x[1])-nflank+1+nchar(x[2]),
nchar(x[1]))),
tolower(x[2]),
toupper(substr(y, 2, nchar(y))),
toupper(substr(x[3], 1, nchar(y)-1)),
tolower(substr(x[3], nchar(y), nchar(y) + nflank - 1))),
x=flank.split, y=res$ALT)
context_split = mapply(FUN=function(x,y) paste0(tolower(substr(x[1],
nchar(x[1])-nflank+1+nchar(x[2]),
nchar(x[1]))),
tolower(x[2]), "[",
toupper(substr(y, 2, nchar(y))), "][",
toupper(substr(x[3], 1, nchar(y)-1)), "]",
tolower(substr(x[3], nchar(y), nchar(y) + nflank - 1))),
x=flank.split, y=res$ALT)
ok.DUP.mask = mapply(FUN=function(x,y) paste0(gsub(".", ".", tolower(substr(x[1],
nchar(x[1])-nflank+1+nchar(x[2]),
nchar(x[1])))),
gsub(".", ".", tolower(x[2])),
gsub(".", "1", toupper(substr(y, 2, nchar(y)))),
gsub(".", "2", toupper(substr(x[3], 1, nchar(y)-1))),
gsub(".", ".", tolower(substr(x[3], nchar(y), nchar(y) + nflank -1)))),
x=flank.split, y=res$ALT)
res = res[, !is.element(colnames(res), drop.col)]
## fixed width, allows overlap
okgrep = function(pattern, seq, w, perl=T, ignore.case=T) {
which(sapply(1:nchar(seq), FUN=function(k) grepl(pattern, substr(seq, k, k + w-1), perl=perl, ignore.case=ignore.case)))
}
okmask = function(pattern, seq, w, perl=T, ignore.case=T) {
mask = sapply(1:nchar(seq), FUN=function(k) grepl(pattern, substr(seq, k, k + w-1), perl=perl, ignore.case=ignore.case));
paste(ifelse(mask, ">", "."), collapse="")
}
okmaskrc = function(pattern, seq, w, perl=T, ignore.case=T) {
seq = as.character(reverseComplement(DNAString(seq)))
mask = sapply(1:nchar(seq), FUN=function(k) grepl(pattern, substr(seq, k, k + w-1), perl=perl, ignore.case=ignore.case));
paste(rev(ifelse(mask, "<", ".")), collapse="")
}
# okgrep("AA", "BAAACCCCCCAA", w=2)
CONTEXT.rc = as.character(reverseComplement(DNAStringSet(res$CONTEXT)))
all.PAM.starts = list() ## start of PAM -- context coords
all.PAM.starts.rc = list() ## start of PAM -- context coords
all.PAM.cuts = list() ## cleavage site -- dup coords
all.PAM.cuts.rc = list() ## cleavage site -- dup coords
## TODO: should read offset right from table so +18/-3 can vary
PAM.offset = ifelse(pam.df$Side=="5'", (pam.df$Width-1) + 18 + 1, -3);
### 54321PAM12345678901234567890
### .><..PAM.................><.
### ..a..s.e..................b.
### okay, w = 3 here, so e = (s+w-1), dist from s = dist from e + (w-1);
### for a, no shift; for b, +1 to put on 3' side of cut; opposiye on Cr strand
### .><.................map..><.
### ..b.................e.s...a.
## per seq
# shiftzero = nflank + res$INS_LEN;
res$CUT_Wa = ""
res$CUT_Cr = ""
for (j in 1:nrow(pam.df)) {
PAM = pam.df$Pattern[j];
w = pam.df$Width[j];
PAMreg = pam.df$Expanded[j];
ok.PAM.starts = sapply(res$CONTEXT, FUN=function(x) okgrep(PAMreg,x,w=w))
ok.PAM.starts.rc = mapply(CONTEXT.rc, FUN=function(x) nchar(x) + 1 - okgrep(PAMreg,x,w=w))
## first base of PAM on repective strand
all.PAM.starts[[j]] = ok.PAM.starts;
all.PAM.starts.rc[[j]] = ok.PAM.starts.rc;
myoffset = PAM.offset[j];
# print(myoffset);
### change coords to: 0 = 1st base after insertion. Also shift by offset
ok.PAM.cuts = mapply(x=ok.PAM.starts, y=res$INS_LEN,
FUN=function(x,y) x + myoffset - nflank - y - 1)
ok.PAM.cuts.rc = mapply(x=ok.PAM.starts.rc, y=res$INS_LEN,
FUN=function(x,y) rev(x - myoffset - nflank - y - 1 + 1))
all.PAM.cuts[[j]] = ok.PAM.cuts;
all.PAM.cuts.rc[[j]] = ok.PAM.cuts.rc;
### change coords to: 0=1st base after insertion
ok.PAM.starts.flat = mapply(x=ok.PAM.starts, y=res$INS_LEN,
FUN=function(x,y) paste(c(x-nflank-y-1), collapse=","))
ok.PAM.starts.flat.rc = mapply(x=ok.PAM.starts.rc, y=res$INS_LEN,
FUN=function(x,y) paste(c(rev(x-nflank-y-1)), collapse=","))
### change coords to: 0=1st base after isertion
ok.PAM.cuts.flat = sapply(ok.PAM.cuts, paste, collapse=",")
ok.PAM.cuts.flat.rc = sapply(ok.PAM.cuts.rc, paste, collapse=",")
res[,paste0("PAM_Wa_", PAM)] = paste0("start={", ok.PAM.starts.flat, "}; cut={", ok.PAM.cuts.flat,"}" )
res[,paste0("PAM_Cr_", PAM)] = paste0("start={", ok.PAM.starts.flat.rc, "}; cut={", ok.PAM.cuts.flat.rc, "}")
}
cut.mat = matrix(NA, nrow(res), nrow(pam.df))
for (j in 1:nrow(cut.mat))
for (k in 1:ncol(cut.mat))
if (length(all.PAM.cuts[[k]][[j]]) > 0) {
perm = order(abs(all.PAM.cuts[[k]][[j]]))
cut.mat[j,k] = (all.PAM.cuts[[k]][[j]])[perm[1]]
}
cut.mat.rc = matrix(NA, nrow(res), nrow(pam.df))
for (j in 1:nrow(cut.mat.rc))
for (k in 1:ncol(cut.mat.rc))
if (length(all.PAM.cuts.rc[[k]][[j]]) > 0) {
perm = order(abs(all.PAM.cuts.rc[[k]][[j]]))
cut.mat.rc[j,k] = (all.PAM.cuts.rc[[k]][[j]])[perm[1]]
}
closest.fw.which = apply(cut.mat, 1, FUN=function(x) which.min(abs(x)))
closest.rc.which = apply(cut.mat.rc, 1, FUN=function(x) which.min(abs(x)))
closest.fw = cut.mat[cbind(1:nrow(cut.mat),closest.fw.which)]
closest.rc = cut.mat.rc[cbind(1:nrow(cut.mat.rc),closest.rc.which)]
res$CUT_Wa = paste0(closest.fw, " (", pam.df$Pattern[closest.fw.which], ")")
res$CUT_Cr = paste0(closest.rc, " (", pam.df$Pattern[closest.rc.which], ")")
## for Cas9, -3
ok.PAM.masks = rep("", nrow(res))
ok.PAM.masks.rc = rep("", nrow(res))
# For Cpf1, +19
ok.PAM.masks2 = rep("", nrow(res))
ok.PAM.masks2.rc = rep("", nrow(res))
for (k in 1:nrow(res)) {
mask = rep(".", nchar(res$CONTEXT[k]))
for (j in 1:nrow(pam.df)) {
if (pam.df$Side[j]=="3'") {
offset = PAM.offset[j];
pos = all.PAM.starts[[j]][[k]];
cpos = pos + offset;
pos = cpos[which(cpos > 0 & cpos <= length(mask))]
if (length(pos) > 0) mask[pos] = ifelse(mask[pos]==".", LETTERS[j], tolower(mask[pos]))
}
}
ok.PAM.masks[k] = paste(mask, collapse="");
}
for (k in 1:nrow(res)) {
mask = rep(".", nchar(res$CONTEXT[k]))
for (j in 1:nrow(pam.df)) {
if (pam.df$Side[j]=="3'") {
offset = PAM.offset[j];
pos = all.PAM.starts.rc[[j]][[k]];
cpos = pos - offset + 1; # + 1 on RC
pos = cpos[which(cpos > 0 & cpos <= length(mask))]
if (length(pos) > 0) mask[pos] = ifelse(mask[pos]==".", LETTERS[j], tolower(mask[pos]))
}
}
ok.PAM.masks.rc[k] = paste(mask, collapse="");
}
####
for (k in 1:nrow(res)) {
mask = rep(".", nchar(res$CONTEXT[k]))
for (j in 1:nrow(pam.df)) {
if (pam.df$Side[j]=="5'") {
offset = PAM.offset[j];
pos = all.PAM.starts[[j]][[k]];
cpos = pos + offset; #
pos = cpos[which(cpos > 0 & cpos <= length(mask))]
if (length(pos) > 0) mask[pos] = ifelse(mask[pos]==".", LETTERS[j], tolower(mask[pos]))
}
}
ok.PAM.masks2[k] = paste(mask, collapse="");
}
####
for (k in 1:nrow(res)) {
mask = rep(".", nchar(res$CONTEXT[k]))
for (j in 1:nrow(pam.df)) {
if (pam.df$Side[j]=="5'") {
offset = PAM.offset[j];
pos = all.PAM.starts.rc[[j]][[k]];
cpos = pos - offset + 1; # + 1 on RC
pos = cpos[which(cpos > 0 & cpos <= length(mask))]
if (length(pos) > 0) mask[pos] = ifelse(mask[pos]==".", LETTERS[j], tolower(mask[pos]))
}
}
ok.PAM.masks2.rc[k] = paste(mask, collapse="");
}
ok.PAM.flat = rep("", nrow(res));
for (k in 1:nrow(res)) {
foo = strsplit(toupper(res$CONTEXT[k]), "")[[1]]
m1 = strsplit(ok.PAM.masks[k], "")[[1]]
m2 = strsplit(ok.PAM.masks.rc[k], "")[[1]]
m3 = strsplit(ok.PAM.masks2[k], "")[[1]]
m4 = strsplit(ok.PAM.masks2.rc[k], "")[[1]]
cutsite = which(m1!="." | m2!="." | m3!="." | m4!=".")
if (length(cutsite)>0) foo[cutsite] = tolower(foo[cutsite])
ok.PAM.flat[k] = paste(foo, collapse="")
}
ok.PAM.flat2 = ok.PAM.flat;
for (k in 1:nrow(res)) {
flat = ok.PAM.flat[k]
ok.PAM.flat2[k] = paste0(substr(flat, 1, nflank),
"[",
substr(flat, nflank+1, nflank+res$INS_LEN[k]),
"|",
substr(flat, nflank+res$INS_LEN[k]+1,nflank+res$INS_LEN[k]+res$INS_LEN[k]),
"]",
substr(flat, nchar(flat)-nflank+1, nchar(flat)))
}
## TODO: Use general-purpose CRISPR finder instead? e.g
# http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0181943
## make hyperlinks?
# OMIM ID: https://www.omim.org/entry/209900
# ClinVar alleleID: http://www.ncbi.nlm.nih.gov/clinvar?term=398446[alleleid]
# gnomAD: http://gnomad.broadinstitute.org/variant/1-63339963-C-CA
chompMe = function(string, n) {
s1 = substr(string, n+1, nchar(string)); ## trim n from start
s2 = substr(s1, 1, nchar(s1)-n); ## trim n from end
s2
}
chompN = nflank - dflank;
simple.CONTEXT = res$CONTEXT;
## "
## mask=strsplit("..........11112222..........","")[[1]]
## seq="abcdefghijWXYXWXYZabcdefghij"
### need to apply elementwise!
seq2HTML = function(seqvec, maskvec) {
hseqvec = rep("", length(seqvec));
for (j in seq_along(seqvec)) {
seq = seqvec[j]
mask = strsplit(maskvec[j],"")[[1]] ### .... 1 2 ....
## ugly!
mask = c("white","yellow", "cyan","green")[as.numeric(sub(".", "0", mask, fixed=T)) + 1]; ## 1,2,3
crle = rle(mask);
ends = cumsum(crle$lengths);
starts = c(1, ends+1)[seq_along(crle$length)]; ## drop last one
hseq = "";
for (k in seq_along(crle$length)) {
hseq = paste0(hseq, "", substr(seq, starts[k], ends[k]), "")
}
## reset to white?
hseqvec[j] = hseq
}
hseqvec;
}
# seq2HTML(seq, mask)
kable.CONTEXT = paste(paste0("`", chompMe(ok.DUP.mask,chompN), " DUP_SEGMENT_BY_NUM", "`"),
paste0("`", chompMe(simple.CONTEXT,chompN), " SEQN_INCLUDING_DUP", "`"),
paste0("`", chompMe(ok.PAM.masks,chompN), " Cas9_CLEAVE_POS_Wa", "`"),
paste0("`", chompMe(ok.PAM.masks.rc,chompN), " Cas9_CLEAVE_POS_Cr", "`"),
paste0("`", chompMe(ok.PAM.masks2,chompN), " Cpf1_CLEAVE_POS_Wa", "`"),
paste0("`", chompMe(ok.PAM.masks2.rc,chompN), " Cpf1_CLEAVE_POS_Cr", "`"), sep=" ")
excel.CONTEXT = paste(paste0("", chompMe(simple.CONTEXT,chompN), "=SEQ_DUP;", ""),
paste0("", chompMe(ok.DUP.mask,chompN), "=DUP_NUM;", ""),
paste0("", chompMe(ok.PAM.masks,chompN), "=Cas9_Wa;", ""),
paste0("", chompMe(ok.PAM.masks.rc,chompN), "=Cas9_Cr;", ""),
paste0("", chompMe(ok.PAM.masks2,chompN), "=Cpf1_Wa;", ""),
paste0("", chompMe(ok.PAM.masks2.rc,chompN), "=Cpf1_Cr;", ""), sep=" ")
colorMask = chompMe(ok.DUP.mask,chompN);
dt.CONTEXT = paste(paste0("", seq2HTML(chompMe(simple.CONTEXT,chompN), colorMask), " SEQ_DUP;", ""),
paste0("", seq2HTML(chompMe(ok.DUP.mask,chompN), colorMask), " DUP_NUM;", ""),
paste0("", seq2HTML(chompMe(ok.PAM.masks,chompN), colorMask), " Cas9_Wa;", ""),
paste0("", seq2HTML(chompMe(ok.PAM.masks.rc,chompN), colorMask), " Cas9_Cr;", ""),
paste0("", seq2HTML(chompMe(ok.PAM.masks2,chompN), colorMask), " Cpf1_Wa;", ""),
paste0("", seq2HTML(chompMe(ok.PAM.masks2.rc,chompN), colorMask), " Cpf1_Cr;", ""),
# paste0("", seq2HTML(chompMe(ok.PAM.flat,chompN), colorMask), " AllWaCr;", ""),
# paste0("", seq2HTML(chompMe(ok.PAM.flat2,chompN), colorMask), " WaCr;", ""),
sep="\n")
dt.CONTEXT = paste0("", dt.CONTEXT, "
")
res$CONTEXT = kable.CONTEXT
# this works, but is that just because longest seq isn't > 2x as long as shortest?
## makes png showing cleavage sites.
# seq="abddefghijABCDABCDabcdefghij"; mask="..........11112222.........."; trim=5; fname="temp_01.png"
pam2PNG = function(seq, mask, cut1, cut1rc, cut2, cut2rc, trim=chompN, fname="temp_01.png"){
nch = nchar(seq) - 2*trim;
cmask = chompMe(mask, trim);
mycols = c(as.numeric(sub(".", "0", strsplit(cmask, "")[[1]], fixed=T))+1)
mycols = mycols + (mycols>2) ## skip green
cutAt = c(1,5,12)+0.5;
scut1 = strsplit(chompMe(cut1, trim),"")[[1]];
scut1rc = strsplit(chompMe(cut1rc, trim),"")[[1]];
scut2 = strsplit(chompMe(cut2, trim),"")[[1]];
scut2rc = strsplit(chompMe(cut2rc, trim),"")[[1]];
cutAt1 = which(scut1!=".")
cutAt1rc = which(scut1rc!=".")
cutAt2 = which(scut2!=".")
cutAt2rc = which(scut2rc!=".")
png(fname, width=40*nch+40, height=230, pointsize=72)
par(mfrow=c(1,1), mar=c(0,0,0,0))
par(family="mono")
plot(1, type="n", xlim=c(0, nch+1), ylim=c(-4.3,2), xaxs="i", yaxs="i", axes=F, ylab=F, xlab=F)
text(seq(1,nch), rep(0,nch), strsplit(chompMe(seq, trim),"")[[1]], pos=3, col=mycols, offset=0.1)
par(family="sans")
if (length(cutAt2rc > 0)) {
arrows(x0=cutAt2rc-0.5, y0=-3.8, x1=cutAt2rc-0.5, y1=0, code=2, lwd=4, length=0.2, col="yellow")
points(x=cutAt2rc-0.5, y=rep(-3.8, length(cutAt2rc)), col="yellow", pch=16, cex=1.0)
text(x=cutAt2rc-0.5, y=rep(-3.8, length(cutAt2rc-0.5)), scut2rc[cutAt2rc], cex=0.5)
}
if (length(cutAt2> 0)) {
arrows(x0=cutAt2-0.5, y0=-3.0, x1=cutAt2-0.5, y1=0, code=2, lwd=4, length=0.2, col="cyan")
points(x=cutAt2-0.5, y=rep(-3.0, length(cutAt2)), col="cyan", pch=16, cex=1.0)
text(x=cutAt2-0.5, y=rep(-3.0, length(cutAt2-0.5)), scut2[cutAt2], cex=0.5)
}
if (length(cutAt1rc > 0)) {
arrows(x0=cutAt1rc-0.5, y0=-2.2, x1=cutAt1rc-0.5, y1=0, code=2, lwd=4, length=0.2, col="yellow")
points(x=cutAt1rc-0.5, y=rep(-2.2, length(cutAt1rc)), col="yellow", pch=15, cex=0.8)
text(x=cutAt1rc-0.5, y=rep(-2.2, length(cutAt1rc-0.5)), scut1rc[cutAt1rc], cex=0.5)
}
if (length(cutAt1 > 0)) {
arrows(x0=cutAt1-0.5, y0=-1.4, x1=cutAt1-0.5, y1=0, code=2, lwd=4, length=0.2, col="cyan")
points(x=cutAt1-0.5, y=rep(-1.4, length(cutAt1)), col="cyan", pch=15, cex=0.8)
text(x=cutAt1-0.5, y=rep(-1.4, length(cutAt1-0.5)), scut1[cutAt1], cex=0.5)
}
dev.off()
}
##########################
dres = res;
dres$CONTEXT = dt.CONTEXT
# From https://stackoverflow.com/questions/47804593/adding-an-image-to-a-datatable-in-r
img_uri = function(x) { sprintf('', knitr::image_uri(x)) }
## add width etc attributes?
make_all_images = F;
make_test_images = F;
if (make_all_images) {
for (k in nrow(res)) {
fname = paste0("temp_pam",".png"); # can over-write after embedding
pam2PNG(seq=simple.CONTEXT[k], mask=ok.DUP.mask[k],
cut1 = ok.PAM.masks[k],
cut1rc =ok.PAM.masks.rc[k],
cut2 = ok.PAM.masks2[k],
cut2rc = ok.PAM.masks2.rc[k],
trim=chompN, fname=fname)
dres$CONTEXT[k] = paste0(img_uri(fname),"
",dres$CONTEXT[k])
}
}
if (make_test_images) {
test_rows = which(is.element(res$SYMBOL, c("HPS1", "TCAP", "HEXA", "DOK7")))
for (k in test_rows) {
fname = paste0("temp_pam_",k,".png"); # can over-write after embedding
pam2PNG(seq=simple.CONTEXT[k], mask=ok.DUP.mask[k],
cut1 = ok.PAM.masks[k],
cut1rc =ok.PAM.masks.rc[k],
cut2 = ok.PAM.masks2[k],
cut2rc = ok.PAM.masks2.rc[k],
trim=chompN, fname=fname)
# dres$CONTEXT[k] = paste0(img_uri(fname),"
",dres$CONTEXT[k])
}
}
AFS.cols = which(grepl("_AF_POPMAX", colnames(dres)) | colnames(dres)=="EX_AF" | colnames(dres)=="GN_AF")
## don't want integer(0) when all are NA/NaN
ok.which.max = function(x) {if (any(!is.na(x))) {return(which.max(x))} else {return(NA)};}
wmax = apply(dres[,AFS.cols], 1, ok.which.max)
dres$MAX_AF_ANY = 0
dres$MAX_AF_WHICH = (colnames(dres)[AFS.cols])[wmax]
for (k in 1:nrow(dres)) {
if (!is.na(wmax[k])) dres$MAX_AF_ANY[k] = dres[k,AFS.cols[wmax[k]]];
if (!is.na(dres$MAX_AF_WHICH[k])) {
if (grepl("POPMAX", dres$MAX_AF_WHICH[k])) {
# ugly!
dres$MAX_AF_WHICH[k] = sub("POPMAX", dres[k, sub("AF_", "", colnames(dres)[AFS.cols[wmax[k]]])], dres$MAX_AF_WHICH[k])
} else if (dres$MAX_AF_WHICH[k]=="EX_AF") {
dres$MAX_AF_WHICH[k]="EX_ALL"
} else if (dres$MAX_AF_WHICH[k]=="GN_AF") {
dres$MAX_AF_WHICH[k]="GN_ALL"
}
}
}
dres$MAX_AF_WHICH = sub("_AF", "", dres$MAX_AF_WHICH)
dres = dres[,c(1:7, ncol(dres)-1, ncol(dres), seq(8,ncol(dres)-2))]
slim = dres[,c("GNOMAD_ID", "INS_LEN", "ALLELEID", "GENEINFO", "MAX_AF_ANY",
"MAX_AF_WHICH", "GN_AF", "EX_AF", "CLNDN", "CUT_Wa", "CUT_Cr", "CLNMEDGEN")]
slim$MAX_AF = sprintf("%0.7f", pmax(slim$GN_AF, slim$EX_AF, na.rm=T))
slim$CUT_SITES = chompMe(ok.PAM.flat2,nflank-5)
colnames(slim)=sub("GNOMAD_ID", "VARIANT", colnames(slim))
colnames(slim)=sub("MAX_AF_ANY", "MAX_AF", colnames(slim))
colnames(slim)=sub("MAX_AF_WHICH", "MAX_POP", colnames(slim))
slim$CLNDN = sub(" \\|.*$", "", slim$CLNDN)
slim$CLNDN = gsub("; ", ",", slim$CLNDN) ## changes for parsing?
# slim$CLNOMIM = sub(" \\|.*$", "", slim$CLNOMIM)
slim$GENEINFO = sub(" .*$", "", slim$GENEINFO)
# grep(",[^ ]", slim$CLNDN) # none, okay
slim2 = slim[,c("VARIANT", "INS_LEN", "ALLELEID", "GENEINFO", "CLNDN","CLNMEDGEN","MAX_AF", "CUT_SITES")];
# gets changed durring subsetting!
slim2$MAX_AF = sprintf("%0.7f", pmax(slim$GN_AF, slim$EX_AF, na.rm=T))
find_acronym = F
if (find_acronym) {
MGCONSO = read.csv("MGCONSO.csv.gz", colClasses="character")
# table(MGCONSO$SAB)
# chunk = MGCONSO[MGCONSO$CUI=="C3150411",]
# MGCHANGE = read.table("MGCONSO.csv.gz", colClasses="character")
MGCONSO[which(MGCONSO$STR =="Persistent hyperinsulinemic hypoglycemia of infancy"),]
MGCONSO[MGCONSO$CUI=="C2931832",] ## why not this?
# GTR genetic testing registry
slim2$MGACR = ""
for (k in 1:nrow(slim)) {
mgids = strsplit(slim$CLNMEDGEN[k], "\\|")[[1]]
# print(mgids)
filled = F;
for (j in seq_along(mgids)) {
if (filled) next
dex1 = which(MGCONSO$CUI==mgids[j])
if (length(dex1)==0) cat("??? -->", mgids[j], " --> ", slim$CLNMEDGEN[k], "-->", slim$CLNDN[k], "\n")
dex2 = which(MGCONSO$CUI==mgids[j] & MGCONSO$SAB=="GTR")
if (length(dex2) > 0) {
chunk = MGCONSO[dex2,]
PTdex = which(chunk$TTY=="PT")
ACRdex = which(chunk$TTY=="ACR")
#cat(slim$CLNDN[k], "-->", chunk$STR[ACRdex], " --> ", chunk$STR[PTdex], "\n")
if (is.element(slim$CLNDN[k], chunk$STR) & length(ACRdex)>0) {
cat(chunk$STR[ACRdex[1]], " --> ", chunk$STR[PTdex[1]], "\n")
slim2$MGACR[k] = chunk$STR[ACRdex[1]];
filled = T;
}
}
}
}
slim2$MGACR[which(slim2$CLNMEDGEN=="C1257959")]="PPHI"
slim2[slim2$MGACR=="",-c(1:4,8)]
MGCONSO[MGCONSO$CUI=="CN001491",]
MGCONSO[MGCONSO$CUI=="C0949658",]
MGCONSO[MGCONSO$CUI=="CN517202",]
slim2$CLNMEDGEN = NULL
slim2 = slim2[,c(1,2,3,4,5,8,6,7)]
}
slim2=cbind("Sequence_ID"=paste0("B", 1:nrow(slim2)), slim2)
write.table(slim2, "slim_table_of_duplications_all.xls", sep="\t", row.names=F, quote=F)
# Serial number Chromosome for variant POS INS_LEN ALLELEID GENEINFO BOTH_AF Clinical significance
# Clinical designation LoF Context PAM_FW_NGG PAM_RC_NGG PAM_FW_NGCG PAM_RC_NGCG PAM_FW_NGAG PAM_RC_NGAG
# PAM_FW_NGAN|NGNG PAM_RC_NGAN|NGNG PAM_FW_NNGRRT PAM_RC_NNGRRT PAM_FW_NNNNGATT PAM_RC_NNNNGATT
# PAM_FW_NNNNRYAC PAM_RC_NNNNRYAC PAM_FW_TTTV PAM_RC_TTTV PAM_FW_TYCV PAM_RC_TYCV PAM_FW_TATV PAM_RC_TATV
# PAM_FW_TTV PAM_RC_TTV
AF.cols = grep("_AF$|_AF_ANY|_AF_POPMAX", colnames(dres))
dres[, AF.cols] = round(dres[, AF.cols], 7)
dres[,"SEQ"] = factor(dres[,"SEQ"], levels=c(1:22,"X","Y","MY"))
dres$ALLELEID = paste0("", res$ALLELEID, "")
dres$GNOMAD_ID = paste0("", res$GNOMAD_ID, "")
## can't filter TGP_AN column which has values only NA and 5008, but can sort it.
for (k in 4:ncol(dres)) { ## skip REF and ALT
if ((class(dres[,k])=="character") && length(unique(dres[,k])) < 11) dres[,k] = as.factor(dres[,k])
}
## use callback to initialize filters? do some color-coding?
## datatable(iris2, extensions = 'ColReorder', options = list(colReorder = TRUE))
DT::datatable(dres, rownames=F, escape=!is.element(colnames(dres),c("ALLELEID","CONTEXT","GNOMAD_ID","IMAGE")),
filter='top',
#class = 'cell-border stripe',
extensions = c('Buttons','FixedHeader'),
options=list(pageLength=20,
lengthMenu = c(20, 50, 200, 500),
fixedHeader = TRUE,
dom='Bfrtip',
buttons=c('csv', 'excel'),
# fixedColumns=list(leftColumns=5),
search=list(regex=TRUE,
caseInsensitive=FALSE)))
make_supp_table = F;
if (make_supp_table) {
# Description of vcf fields from ClinVar:
# https://www.ncbi.nlm.nih.gov/snp/docs/products/vcf/redesign/
suppDex = which(dres$INS_LEN > 1);
dres2 = dres[suppDex, is.element(colnames(dres),
c("SEQ","POS","CLNSIG","CLNDN","INS_LEN","ALLELEID","GENEINFO","BOTH_AF","LoF", "CONTEXT"))
| grepl("^PAM", colnames(dres))]
dres2 = cbind("Serial number"=1:nrow(dres2), dres2);
colnames(dres2)=sub("^SEQ$", "Chromosome", colnames(dres2))
colnames(dres2)=sub("^POS$", "Position", colnames(dres2))
colnames(dres2)=sub("^CLNSIG$", "Clinical Significance", colnames(dres2))
colnames(dres2)=sub("^CLNDN$", "Disease name", colnames(dres2))
# search="(TCAP|HPS1|HEXA|DOK7)")))
dres2$CONTEXT = chompMe(context_split[suppDex],chompN) ## add bar in between them?
# dres2$CONTEXT = gsub("[", "", gsub("]","", dres2$CONTEXT, fixed=T), fixed=T);
## change pams to just have splits
PAM.col = grep("^PAM", colnames(dres2))
for (pc in PAM.col) {
dres2[,pc] = sub("\\}", "", sub("^.*cut=\\{","", dres2[,pc]))
}
for (pc in PAM.col) {
for (j in 1:nrow(dres2)) {
pos = as.numeric(strsplit(dres2[j,pc],",")[[1]])
if (length(pos) > 0) pos = pos[abs(pos) < dres2$INS_LEN[j]] ## +-1 on one end? Maybe not, since 0 is right inbetween
dres2[j,pc] = paste0("{",paste(pos, collapse=","),"}") ## comma makes excel behave oddly?
}
}
DT::datatable(dres2, rownames=F, escape=!is.element(colnames(dres2),c("ALLELEID","CONTEXT","GNOMAD_ID","IMAGE")),
filter='top',
#class = 'cell-border stripe',
extensions = c('Buttons','FixedHeader'),
options=list(pageLength=20,
lengthMenu = c(20, 50, 200, 500),
fixedHeader = TRUE,
dom='Bfrtip',
buttons=c('csv', 'excel'),
# fixedColumns=list(leftColumns=5),
search=list(regex=TRUE,
caseInsensitive=FALSE)))
}
wres = res;
## change CONTEXT columns
wres$CONTEXT = excel.CONTEXT
if (F) {
tab1 = read.table("~/Dropbox/microduplications/consensed_table_of_duplications_v2.txt",
sep="\t", header=T, quote="", stringsAsFactors=F)
tab2 = read.table("~/Dropbox/microduplications/slim_table_of_duplications_AD.xls",
sep="\t", header=T, quote="",
stringsAsFactors=F)
tab3 = read.table("~/Dropbox/microduplications/slim_table_of_duplications_ALL.xls",
sep="\t", header=T, quote="",
stringsAsFactors=F)
length(unique(c(tab3$CLNDN, gsub("\"", "", tab1$CLNDN))))
length(unique(tab1$CLNDN))
length(unique(tab2$CLNDN))
length(unique(tab3$CLNDN))
length(unique(c(tab3$CLNDN, tab2$CLNDN)))
}
```
### What are we missing out on by only looking at the "coding" regions (exome_calling_regions.v1.interval_list) in the gnomAD genomes?
A lot of duplications, no doubt --- perhaps around 98% of them --- but it does not appear that any of these are
annotated as "Pathogenic" in ClinVar. Certainly there are many variants listed in ClinVar that are not observed
in either the gnomAD genomes or exomes, so are not accounted for in the table above, and this
includes `r length(check.me.in.callset) + length(check.me.off.callset)` duplications that satisfy all the
additional conditions for being in column **dup2iP** above. But `r length(check.me.in.callset)` of these are
in these "coding" intervals, so if the variants had been observed at all in gnomAD they would have been reported
in these vcfs. For the `r length(check.me.off.callset)` that are _not_ within these intervals, one can check
for them in the full (not-just-coding-intervals) gnomAD and TGP vcfs --- they are not observed there either.
These variants are listed below; they are mainly variants in UTRs or in intronic regions >50 bases from the
nearest exons (and hence not in the "coding" intervals list).
#### Table 4: selected pathogenic duplications from ClinVar that are outside of gnomAD "coding" regions {#anch4}
```{r missing, echo=F}
slim.cols = c(1:7,12:13,18)
mytab = itab[check.me.off.callset, slim.cols];
mytab$CLNDN = gsub("|", " | ", gsub(";",",",gsub("_", " ", mytab$CLNDN)), fixed=T)
mytab$ALT = paste0(substr(mytab$ALT, 1, 20), ifelse(nchar(mytab$ALT) > 20, "...", ""))
kable(mytab, row.names=F)
```
### What other pathogenic duplicates in ClinVar might we be missing when looking at gnomAD/TGP for allele frequencies?
It wouldn't be surprising to miss out on variants that are extremely rare in general, or even not-terribly-rare
variants that are concentrated in populations without many samples: with only ~100 subjects per population in
the TGP data one would expect to miss out on ~13% of alleles with frequency 0.01 in these populations. And a
few other possibilities:
* Subjects known to have severe pediatric disease were not included in the gnomAD dataset, so variants that
cause these diseases may be under-represented, in particular those with dominant inheritance.
* About 8% of the genome was masked during the gnomAD variant calling (e.g. some repetitive sequence), so any
ClinVar variants the fall in these regions will not be reported in gnomAD. But it appears that only one of
the ~13000 insertions from ClinVar falls in one of these masked region --- a benign variant in SHOX1 which
is masked since it's in a PAR on the Y chromosome.
* gnomAD doesn't report variants on the Y chromosome at all, whether in masked regions or not. But the
only "Pathogenic" duplication on the Y chromosome in ClinVar is a single-base insertion, Y:2655380:C/CT
in the gene SRY: https://www.ncbi.nlm.nih.gov/clinvar/variation/470195/
* The longest insertion reported in the gnomAD coding regions has length `r max(which(countGE$insertions > 0))`,
and there are `r sum(countGE$insertions[-c(1:99)])` insertions of length at least 100, but it's possible that
the detection sensitivity may decline for longer insertions. Below is a table of the `r length(check.me.long)`
"Pathogenic"" duplications with length at least 100 from ClinVar that satisfy the conditions of the
column **dup2iP** above, none of which are observed in gnomAD.
#### Table 5: selected pathogenic duplications of length at least 100 from ClinVar {#anch5}
```{r long, echo=F}
mytab = itab[check.me.long, slim.cols];
mytab$CLNDN = gsub("|", " | ", gsub(";",",",gsub("_", " ", mytab$CLNDN)), fixed=T)
mytab$ALT = paste0(substr(mytab$ALT, 1, 20), ifelse(nchar(mytab$ALT) > 20, "...", ""))
kable(mytab, row.names=F)
```
Below is a table of "simple" duplications of length 2-40 that are annotated in ClinVar as "Pathogenic" or "Pathogenic/Likely pathogenic", and are associated in ClinVar with a disease annotated as autosomal dominant (AD) in OMIM, but which do not appear in gnomAD:
#### Table 6: selected pathogenic duplications from ClinVar that are not observed in gnomAD {#anch6}
```{r no_af, echo=F}
## TODO: get moi from medgen?
check.me.no.af = which((itab$TANDEM_N == 1) &
(itab$SHIFT_SYMM == 0) &
grepl("Path", itab$CLNSIG) &
# (itab$CLNDN!="not_specified") & ## not necessary for AD?
# (itab$CLNDN!="not_provided") & ## not necessary for AD?
grepl("^AD$", itab$OMIM_MOI) &
(itab$INS_LEN >= 2) &
(itab$INS_LEN <= 40) &
(is.na(itab$GN_AC) & is.na(itab$EX_AC)))
## 255: between 2 and 40
## 265 with no max len, 770 ignoring AD
check.me.no.af = check.me.no.af[order(nchar(itab$ALT[check.me.no.af]))]
mytab = itab[check.me.no.af, slim.cols];
mytab$CLNDN = gsub("|", " | ", gsub(";",",",gsub("_", " ", mytab$CLNDN)), fixed=T)
mytab$ALLELEID = paste0("", mytab$ALLELEID, "")
DT::datatable(mytab, rownames=F, escape=!is.element(colnames(mytab),c("ALLELEID","CONTEXT","GNOMAD_ID","IMAGE")),
filter='top',
#class = 'cell-border stripe',
extensions = c('Buttons','FixedHeader'),
options=list(pageLength=20,
lengthMenu = c(20, 50, 200, 500),
fixedHeader = TRUE,
dom='Bfrtip',
buttons=c('csv', 'excel'),
# fixedColumns=list(leftColumns=5),
search=list(regex=TRUE,
caseInsensitive=FALSE)))
```
```{r session_info, echo=F}
writeLines(text=capture.output(sessionInfo()), con=paste0("session_info", format(Sys.time(), "%b_%d_%Y_%H_%M_%S"), ".txt"))
```
### End