Using SomaticSignatures to Identify Mutation Signatures from Mutation Data

SomaticSignatures package was published in 2015 in the bioinformatics journal, a specialized journal in bioinformatics. This package is designed to analyze tumor single-nucleotide variants (SNP) data to identify tumor occurrence, development, and evolutionary mechanisms. This article will introduce how to use SNV data analysis to obtain characteristic SNPs of tumors.

Data Preparation

The data required is SNP data, including the fields “Sample”, “chr”, “pos”, “ref”, “alt”, which correspond to sample, chromosome, SNP start position, end position, reference base, and alternate base respectively. For the downloaded TCGA data, I processed it with Python.

Raw data:

import pandas as pd

df = pd.read_csv("TCGA.CRC.mutect.maf.csv")
df.head()
"""
  Hugo_Symbol  Entrez_Gene_Id  ... MC3_Overlap GDC_Validation_Status
0      UBE2J2          118424  ...        True               Unknown
1       RPL22            6146  ...        True               Unknown
2     TNFRSF9            3604  ...        True               Unknown
3     EXOSC10            5394  ...       False               Unknown
4      PTCHD2           57540  ...        True               Unknown
"""
df = df[df.Variant_Type=='SNP']
# First filter SNP data
newdf = df.iloc[:,[33,4,5,11,12]]
# Select specific columns
newdf.columns = ["Sample","chr", "start","ref",  "alt"]
newdf.ref = newdf.ref.str[:1]
# Remove rows where ref and alt are the same to avoid errors later
newdf = newdf[newdf.ref != newdf.alt]
# Save data for R analysis
newdf.to_csv("TCGA.CRC.mutect.maf.filtered.csv",index=None)

Signature Selection

We use R for the analysis because SomaticSignatures package is required. First, install necessary packages:

install.packages(c("SomaticSignatures","SomaticCancerAlterations","BSgenome.Hsapiens.UCSC.hg38","data.table"))

Import packages:

suppressPackageStartupMessages(library(SomaticSignatures))
suppressPackageStartupMessages(library(SomaticCancerAlterations))
suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg38))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(ggplot2))

Read data and convert it into the format needed for mutationContext:

df=fread('TCGA.CRC.mutect.maf.filtered.csv',data.table = F)
# ["Sample","chr", "start","end","ref",  "alt"]
alls=as.character(unique(df$Sample))
df$study=df$Sample

sca_vr = VRanges(
  seqnames =  df$chr ,
  ranges = IRanges(start = df$start,end = df$start+1),
  ref = df$ref,
  alt = df$alt,
  sampleNames = as.character(df$Sample),
  study=as.character(df$study))

Run mutationContext, plot signatures, and check variance explained:

sca_motifs = mutationContext(sca_vr, BSgenome.Hsapiens.UCSC.hg38)
head(sca_motifs)
# For each sample, calculate the proportion distribution of 96 mutation possibilities
escc_sca_mm = motifMatrix(sca_motifs, group = "study", normalize = TRUE)
dim( escc_sca_mm )
table(colSums(escc_sca_mm))
head(escc_sca_mm[,1:4])
n_sigs = 5:15
gof_nmf = assessNumberSignatures(escc_sca_mm , n_sigs, nReplicates = 5)
save(gof_nmf,file = 'gof_nmf.Rdata')
load(file = 'gof_nmf.Rdata')
# This assessNumberSignatures step is very time-consuming.
pdf("plotNumberSignatures.pdf",width=18, height=7)
plotNumberSignatures(gof_nmf)
dev.off()

Results show:

When the number of signatures reaches 13, it explains about 99% of the variance.

Signature Plotting

Finally, we extract somatic mutation signatures based on the previous selection:

sigs_nmf = identifySignatures(escc_sca_mm  ,
                             11, nmfDecomposition)
save(escc_sca_mm,sigs_nmf,file = 'escc_denovo_results.Rata')

load(file = 'escc_denovo_results.Rata')
str(sigs_nmf)
library(ggplot2)
pdf("maf.pdf",width=18, height=7)
plotSignatureMap(sigs_nmf) + ggtitle("Somatic Signatures: NMF - Heatmap")
plotSignatures(sigs_nmf, normalize =T) +
  ggtitle("Somatic Signatures: NMF - Barchart")  +
  facet_grid(signature ~ alteration,scales = "free_y")
dev.off()

Final results:

Summary

This article demonstrates how to screen for needed SNP signatures in tumor data (using either patient data or public TCGA data), rather than using the 30 tumor mutation signatures from the COSMIC database. This can help experts in specific tumor types find more detailed and meaningful results.

Data Preparation

Signature Selection

Signature Plotting

Summary

Related