Working with DNA sequences and features in R with Bioconductor

Bioconductor is a repository for R packages relating to high-throughput biology (shotgun sequencing, microarrays, etc).

Started in 2001, has developed alongside technology and growth of data
Core set of types promotes interoperation
(Not just tidy data frames! Types have evolved over time to meet new needs.)
Twice yearly release schedule
Many contributors, but fairly good coding and documentation standards

Flavours of R

Base R

Functions and types that are always available in R.

CRAN (17,264 packages as at March 2021)

“Comprehensive R Archive Network” - main source of R packages

install.packages("glmnet")
install.packages("tidyverse")

Bioconductor (1,974 packages as at March 2021)

# Setup
install.packages("BiocManager")

# Install a Bioconductor package
BiocManager::install("limma")

# Check for out-of-date packages
BiocManager::install()

What Bioconductor is for

Genome and genome features: genes, regions, motifs, peaks, primers, SNPs ← focus for today

eg for ChIP-seq, ATAC-seq, variant calling

Biostrings, GenomicRanges, GenomicFeatures, VariantAnnotation, …

Differential testing of microarray, RNA-seq, and other high-throughput data

SummarizedExperiment, limma, edgeR, DESeq2, …
Further packages to normalize, impute, batch correct, …

Single-cell gene expression, etc

SingleCellExperiment, scater, scran, …
“Orchestrating Single-Cell Analysis with Bioconductor” book

Visualization

ComplexHeatmap, Gviz, ggbio, glimma, …

Statistical methods for $p \gg n$ data

…

More types

Can do a lot in R with vectors and data frames.

Using Bioconductor will mean building familiarity with further types:

matrix, list
DNAString, DNAStringSet, GRanges ← focus for today

… Seqinfo, TxDb, EnsDb, OrgDb, VCF, SummarizedExperiment, DelayedArray, …

✋ S3 classes

Most base R packages use “S3” classes. Examples include data.frame, tibble, and linear model fits.

usually a list vector underneath
if necessary, peek inside with $

👉👉 S4 classes

Bioconductor uses “S4” classes, including novel classes such as GRanges and S4 equivalents of familiar classes such as DataFrame. Can usually convert to base R types with as.data.frame(), as.list(), as.character(), as.numeric(), as().

use accessor functions such as seqnames, start, end, width, nchars
if absolutely necessary, peek inside with @

Stuart Lee’s guide to S4 for the perplexed

(do workshop)

Genome browsers

On the web:

On your desktop:

Integrative Genomics Viewer
- Can view many file types, and very large files such as BAM files.

Files in bioinformatics

What	File types	R types
DNA sequence	FASTA^I, FASTQ, 2bit^R	Biostrings::DNAStringSet, BSgenome^R, rtracklayer::TwoBitFile^R
Amino acid sequence	FASTA	Biostrings::AAStringSet
Genomic features	GTF^I, GFF^I, BED^I	GenomicRanges::GRanges, GenomicFeatures::TxDb^R, ensembldb::EnsDb^R
Read alignments	SAM, BAM^I	GenomicAlignments::GAlignments, Rsamtools::BamFile^I
Numeric data along a genome	wiggle, bigWig^R	list of numeric vectors, IRanges::RleList, rtracklayer::BigWigFile^R
Variant calls	VCF^I	VariantAnnotation::VCF
Numeric matrix (gene expression, etc)	CSV, TSV, HDF5^R	matrix, HDF5Array^R, DelayedArray^R, SummarizedExperiment^R, SingleCellExperiment^R, …

(plus many more)

R = random access to large files
I = random access with an accompanying index file
Prefer these file types!

rtracklayer::import() can read many file types.

R indexing is 1-based

1-based

R, GFF and GTF feature files,
SAM alignment files, …

   +---+---+---+---+---+---+
   | A | C | G | T | A | C |
   +---+---+---+---+---+---+
     1   2   3   4   5   6
             |       |
             |<----->|        
    
    seq[3] == "G"   seq[3:5] == "GTA"

0-based

Python, C,
BED feature files, …

   +---+---+---+---+---+---+
   | A | C | G | T | A | C |
   +---+---+---+---+---+---+
   0   1   2   3   4   5   6
           |           |
           |<--------->|      

   seq[2] == "G"    seq[2:5] == "GTA"

Bioconductor file import functions will convert all data to 1-based.

Need to do own conversion if using read.table(), etc.

Reference genomes and annotations

“High-throughput” biological data analysis usually occurs in the context of:

a genome assembly ← Updated infrequently. RefSeq, Ensembl, UCSC all use same sequences.
gene and transcript annotations ← Updated frequently. Different between RefSeq, Ensembl, UCSC.

Main sources for model organisms:

🇺🇸 The NCBI’s RefSeq
🇪🇺 The EBI’s Ensembl genome browser
The UCSC genome browser

… UCSC was the original web-based genome browser. UCSC’s “KnownGene” gene annotations used to be the cutting edge gene annotation source, but UCSC now relies on other sources for gene annotations. Many file types that remain important today were developed for the UCSC genome browser, such as “bed”, “bigWig”, and “2bit”.

Genome assemblies are released infrequently. GRCh38 (hg38) was released in 2013. The previous assembly, GRCh37 (hg19) was released in 2009. Some people haven’t updated yet, you will find plenty of data using “hg19” positions! Gene and transcript annotations are updated far more frequently.

As well as the chromosomes in the “primary assembly” a genome assembly may have further sequences, which may have been added after the initial release:

• patch sequences: fixes that would change the sizes of chromosomes
• alt loci: a way to represent alleles, genetic diversity in the species

(return to workshop)