SomaticSignatures package was published in 2015 in the bioinformatics journal, a specialized journal in bioinformatics. This package is designed to analyze tumor single-nucleotide variants (SNP) data to identify tumor occurrence, development, and evolutionary mechanisms. This article will introduce how to use SNV data analysis to obtain characteristic SNPs of tumors.
Data Preparation
The data required is SNP data, including the fields “Sample”, “chr”, “pos”, “ref”, “alt”, which correspond to sample, chromosome, SNP start position, end position, reference base, and alternate base respectively. For the downloaded TCGA data, I processed it with Python.
Raw data:
import pandas as pd
df = pd.read_csv("TCGA.CRC.mutect.maf.csv")
df.head()
"""
Hugo_Symbol Entrez_Gene_Id ... MC3_Overlap GDC_Validation_Status
0 UBE2J2 118424 ... True Unknown
1 RPL22 6146 ... True Unknown
2 TNFRSF9 3604 ... True Unknown
3 EXOSC10 5394 ... False Unknown
4 PTCHD2 57540 ... True Unknown
"""
df = df[df.Variant_Type=='SNP']
# First filter SNP data
newdf = df.iloc[:,[33,4,5,11,12]]
# Select specific columns
newdf.columns = ["Sample","chr", "start","ref", "alt"]
newdf.ref = newdf.ref.str[:1]
# Remove rows where ref and alt are the same to avoid errors later
newdf = newdf[newdf.ref != newdf.alt]
# Save data for R analysis
newdf.to_csv("TCGA.CRC.mutect.maf.filtered.csv",index=None)
Signature Selection
We use R for the analysis because SomaticSignatures package is required. First, install necessary packages:
install.packages(c("SomaticSignatures","SomaticCancerAlterations","BSgenome.Hsapiens.UCSC.hg38","data.table"))
Import packages:
suppressPackageStartupMessages(library(SomaticSignatures))
suppressPackageStartupMessages(library(SomaticCancerAlterations))
suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(ggplot2))
Read data and convert it into the format needed for mutationContext:
df=fread('TCGA.CRC.mutect.maf.filtered.csv',data.table = F)
# ["Sample","chr", "start","end","ref", "alt"]
alls=as.character(unique(df$Sample))
df$study=df$Sample
sca_vr = VRanges(
seqnames = df$chr ,
ranges = IRanges(start = df$start,end = df$start+1),
ref = df$ref,
alt = df$alt,
sampleNames = as.character(df$Sample),
study=as.character(df$study))
Run mutationContext, plot signatures, and check variance explained:
sca_motifs = mutationContext(sca_vr, BSgenome.Hsapiens.UCSC.hg38)
head(sca_motifs)
# For each sample, calculate the proportion distribution of 96 mutation possibilities
escc_sca_mm = motifMatrix(sca_motifs, group = "study", normalize = TRUE)
dim( escc_sca_mm )
table(colSums(escc_sca_mm))
head(escc_sca_mm[,1:4])
n_sigs = 5:15
gof_nmf = assessNumberSignatures(escc_sca_mm , n_sigs, nReplicates = 5)
save(gof_nmf,file = 'gof_nmf.Rdata')
load(file = 'gof_nmf.Rdata')
# This assessNumberSignatures step is very time-consuming.
pdf("plotNumberSignatures.pdf",width=18, height=7)
plotNumberSignatures(gof_nmf)
dev.off()
Results show:
When the number of signatures reaches 13, it explains about 99% of the variance.
Signature Plotting
Finally, we extract somatic mutation signatures based on the previous selection:
sigs_nmf = identifySignatures(escc_sca_mm ,
11, nmfDecomposition)
save(escc_sca_mm,sigs_nmf,file = 'escc_denovo_results.Rata')
load(file = 'escc_denovo_results.Rata')
str(sigs_nmf)
library(ggplot2)
pdf("maf.pdf",width=18, height=7)
plotSignatureMap(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Heatmap")
plotSignatures(sigs_nmf, normalize =T) +
ggtitle("Somatic Signatures: NMF - Barchart") +
facet_grid(signature ~ alteration,scales = "free_y")
dev.off()
Final results:
Summary
This article demonstrates how to screen for needed SNP signatures in tumor data (using either patient data or public TCGA data), rather than using the 30 tumor mutation signatures from the COSMIC database. This can help experts in specific tumor types find more detailed and meaningful results.