Bioinformatics at the Center for Cancer Genomics

Contact the Center for Cancer Genomics

CCG Associate Director: Aaron Thorner
DCBC Biopsy Collections: Karla Helvie
Bioinformatics: Tony Anselmo

For general questions: 
617-582-7253 or



The Bioinformatics group consists of data analysts, software engineers, and computational biologists who have developed analytical pipelines to manage, store, annotate, and report on data produced by the Illumina sequencing platforms. We employ a combination of vendor, third-party, and in-house tools and databases to provide data-quality metrics, integrated candidate reports, and relevant biological and clinical context for experimental platform data.

The Bioinformatics team can provide help in:

  • Experimental design
  • Custom bait set design
  • DNA sequence analysis including:
    • Variant detection (SNV, Indel, CNV, and structural variants) and annotation
    • SNP-based sample fingerprinting
  • 10X Genomics data analysis:
    • Single-cell and single-nuclei (sc/sn) Gene expression (GEx)
    • Flex GEx (Fixed RNA Profiling)
    • Multiome (scATAC + GEx)
    • V(D)J T-Cell/B-Cell + GEx
    • Cell cluster identification/annotation
    • HLA typing for 5' GEx
  • Bulk RNA sequence analysis including:
    • Fusion analysis
    • Differential expression
    • Alternative transcript expression
  • Sample QC evaluation and troubleshooting
  • Customized analyses based on your project-specific needs
  • Specialized analysis methods for:
    • PDX models
    • Cell-free DNA
    • Viral DNA detection in tumor samples

We also develop new tools and strategies in a research setting that are then translated to the clinic. Our latest developments include BreaKmer for detection of structural rearrangements and RobustCNV for detecting changes in gene copies. In addition, the team is developing methods to analyze samples derived from PDX models, cell-free DNA, and single-cell sequencing.

In concert with developing new and updated offerings, the bioinformatics group has initiated the process of porting much of its analytics pipeline infrastructure to the cloud (currently Google's GCP) to speed data processing throughput and accommodate third-parties and collaborators that may primarily maintain an online cloud presence.

sc/sn Gene Expression (10X Genomics)



10x Genomics offers advanced single-cell and single-nuclei (sc/sn) workflows that enable detailed analysis of individual cells to unravel complex biological systems. Utilizing barcoded gel beads and unique molecular identifiers (UMIs), the 10X platform can capture and barcode RNA from thousands of individual cells in parallel. This allows for precise quantification and profiling of gene expression on a cell-by-cell basis. The resulting data provides insights into cell heterogeneity, identifies rare cell types, and offers a higher-resolution understanding of biological processes compared to traditional bulk RNA sequencing.


Using 10X Cell Ranger software, sc/sn RNA sequencing reads are demultiplexed and aligned to the relevant transcriptome (pre-built references for which are available from 10X) to create cell barcode x gene expression matrix tables, which can be used for downstream data analysis via such open-source sc/sn tools as Seurat ( or Scanpy ( Cell Ranger also outputs this sc/sn gene expression data in a proprietary loupe file format, which can be explored via the 10X Loupe Browser (

Upon request, CCG Bioinformatics is able to provide a comprehensive array of data analyses, including quality control, differential expression, cell annotation, HLA typing ….and more.

scATAC (10X Genomics)


In addition to sc/sn RNaseq, 10x Genomics has also developed a single-cell ATACseq (Assay for Transposase-Accessible Chromatin) workflow. Unlike traditional bulk ATACseq methods, which provide an average view of chromatin accessibility across many cells, the 10x single-cell ATACseq technique deciphers the regulatory landscape of each cell. This is achieved by utilizing a combination of a transposase, which inserts sequencing adaptors into open chromatin regions, and unique gel bead barcodes to tag the DNA of individual cells. After sequencing, the data is demultiplexed using these barcodes, ensuring accurate attribution of chromatin accessibility data to each originating cell. This technology is a powerful tool to dissect the complexities of development, tissue heterogeneity, and disease progression at an unprecedented cellular resolution.


Similar to sc/sn gene expression analysis, 10X Cell Ranger can be used to demultiplex and align scATAC sequencing reads to the relevant genome reference. ATAC peaks called de novo or from previously curated datasets can then be used to generate sparse cell barcode x peak read density matrices. A popular option for further downstream analysis of the scATAC data is the open-source software Signac ( Of course, CCG Bioinformatics can work with laboratories to provide bespoke data analysis solutions based on specific project goals.



BreaKmer (Abo et al 2015) is designed to detect larger genomic structural variations from single sample aligned short read target-captured high-throughput sequence data. It detects variation from sequence reads that result in aligned split-read signatures, such as inter- and intra-chromosomal rearrangements and insertion/deletion events with sizes that result in split-reads from targeted high-throughput sequence data. Briefly, the method extracts "misaligned" sequences from a targeted region, such as split-reads and unmapped mates, assembles a contig from these reads, and re-aligns the contig to make a variant call. It classifies detected variants as "insertions/deletions," "tandem duplications," "inversion," and "translocations."

Abo, RP, Ducar, M, Garcia, EP, Thorner, AR, Rojas-Rudilla, V, Lin, L, Sholl, LM, Hahn, WC, Meyerson, M, Lindeman, NI, Van Hummelen, P, MacConaill, LE (2015). BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Research, 43, 3: e19



Copy number variants are now being identified using RobustCNV, a new algorithm developed at the Center for Cancer Genomics (CCG).

RobustCNV relies on localized changes in the mapping depth of sequenced reads in order to identify changes in copy number at the loci sampled during targeted capture. This strategy includes a normalization step in which systematic bias in mapping depth is reduced or removed by fitting a model against a panel of normals and by removal of residual GC bias using a loess fit. Normalized coverage data is then segmented using Circular Binary Segmentation (Olshen et al, 2004). Finally, copy number calls are assigned using an adaptive calling strategy that adjusts calling thresholds based on the post-normalization variability in each sample.

This strategy is most effective when the panel of normals contains samples which have a pattern of systematic bias that closely matches the bias in the tumor samples. In situations where this is not the case, the samples can remain noisy and CNV calls may be difficult to make correctly. For this reason, normal samples of similar tissue quality, age, fixation, and processing to tumor samples should be included in all studies where identifying copy number variants is an objective.