Working with DNA sequences and features in R with Bioconductor

Bioconductor

Bioconductor is repository for R packages relating to high-throughput biology (shotgun sequencing, microarrays, etc).

Started in 2001, has developed alongside technology and growth of data
Core set of types promotes interoperation
(Not just tidy data frames! Types have evolved over time to meet new needs.)
Twice yearly release schedule
Many contributors, but fairly good coding and documentation standards

Flavours of R

Base R

Functions and types that are always available in R.

CRAN (17,264 packages as at March 2021)

“Comprehensive R Archive Network” - main source of R packages

install.packages("glmnet")
install.packages("tidyverse")

Bioconductor (1,974 packages as at March 2021)

# Setup
install.packages("BiocManager")

# Install a Bioconductor package
BiocManager::install("limma")

# Check for out-of-date packages
BiocManager::install()

What Bioconductor is for

Genome and genome features: genes, regions, motifs, peaks, primers, SNPs ← focus for today

eg for ChIP-seq, ATAC-seq, variant calling

Biostrings, GenomicRanges, GenomicFeatures, …

Differential gene expression from microarray or RNA-seq

SummarizedExperiment, limma, edgeR, DESeq2, …
Further packages to normalize, impute, batch correct, check quality, …

Single-cell gene expression, etc

SingleCellExperiment, scater, scran, …
“Orchestrating Single-Cell Analysis with Bioconductor” book

Visualization

ComplexHeatmap, Gviz, ggbio, …

Statistical methods for $p \gg n$ data

…

More types

Can do a lot in R with vectors and data frames.

Using Bioconductor will mean building familiarity with further types:

matrix, list
DNAString, DNAStringSet, GRanges ← focus for today

… Seqinfo, TxDb, EnsDb, OrgDb, VCF, SummarizedExperiment, DelayedArray, …

✋ S3 types

Most base R packages use “S3” types. Data frames, tibbles, and linear model fits are examples of these.

essentially lists
if necessary, peek inside with $

👉👉 S4 types

Bioconductor uses “S4” types, including it’s own data frame (DataFrame) and list types (SimpleList, GRangesList, etc etc). If stuck, can almost always convert to base R types with as.data.frame, as.list, as.character, as.numeric.

use accessor functions such as seqnames, start, end, width, nchars
if absolutely necessary, peek inside with @

Stuart Lee’s guide to S4 for the perplexed

(do workshop)

Reference genomes and annotations

“High-throughput” biological data analysis usually occurs in the context of:

a genome assembly
gene and transcript annotations

For model organisms such as human and mouse there is a genome assembly most people use, updated infrequently, and several slightly different sets of gene and transcript annotations from different sources, updated much more frequently.

Available from:

🇺🇸 The NCBI’s RefSeq
🇪🇺 The EBI’s Ensembl genome browser
The UCSC genome browser

Genome browsers

On the web:

UCSC
Ensembl

On your desktop:

Integrative Genomics Viewer
- Can view many file types.
- View very large files such as BAM or bigWig.

Files in bioinformatics

What	File types	R types
DNA sequence	FASTA(I), FASTQ, 2bit(R)	Biostrings::DNAStringSet, BSgenome(R), rtracklayer::TwoBitFile(R)
Amino acid sequence	FASTA	Biostrings::AAStringSet
Genomic features	GTF(I), GFF(I), BED(I)	GenomicRanges::GRanges, GenomicFeatures::TxDb(R), ensembldb::EnsDb(R)
Read alignments	SAM, BAM(I)	GenomicAlignments::GAlignments, Rsamtools::BamFile(I)
Numeric data along a genome	wiggle, bigWig(R)	list of numeric vectors, IRanges::RleList, rtracklayer::BigWigFile(R)
Variant calls	VCF(I)	VariantAnnotation::VCF
Numeric matrix (gene expression, etc)	CSV, TSV, HDF5(R)	matrix, HDF5Array(R), DelayedArray(R), SummarizedExperiment(R), SingleCellExperiment(R), …

(plus many more)

(R) random access to large files
(I) random access with an accompanying index file
Prefer these file types!

rtracklayer::import() can read many file types.

Command-line bioinformatics software

Not all bioinformatics software is an R package!

R’s role will often be to massage your data into the form needed for command line tools, or to examine a tool’s output.

One way to install command-line software is using the Conda package manager:

Install miniconda
Use the bioconda channel

# Example
conda install -c bioconda meme

Can be used on your own computer, or on a server or cluster.
Doesn’t need admin rights.

(continue workshop)

Using Bioconductor

Find some useful packages
Read the vignettes
Read the reference documentation for specific functions you need
If you run into a funny class, check its documentation, work out the accessor functions, and in a pinch poke around its internals with @ or try as.data.frame.

?"GRanges-class"

methods(class="GRanges")

The Bioconductor website includes material from many and various tutorials/vignettes/workflows
Most downloaded Bioconctor packages
Mike Love’s Bioconductor cheat sheet
plyranges provides a “tidy” way of working with GRanges with many powerful features, developed by Dr. Stuart Lee at Monash. A plyranges workshop
Bioconductor’s Stack-Overflow-style support site