Bioconductor

Bioconductor is repository for R packages relating to high-throughput biology (shotgun sequencing, microarrays, etc).

  • Started in 2001, has developed alongside technology and growth of data

  • Core set of types promotes interoperation
    (Not just tidy data frames! Types have evolved over time to meet new needs.)

  • Twice yearly release schedule

  • Many contributors, but fairly good coding and documentation standards

Flavours of R

Base R

Functions and types that are always available in R.

CRAN (17,264 packages as at March 2021)

“Comprehensive R Archive Network” - main source of R packages

install.packages("glmnet")
install.packages("tidyverse")

Bioconductor (1,974 packages as at March 2021)

# Setup
install.packages("BiocManager")

# Install a Bioconductor package
BiocManager::install("limma")

# Check for out-of-date packages
BiocManager::install()

What Bioconductor is for

Genome and genome features: genes, regions, motifs, peaks, primers, SNPs ← focus for today

eg for ChIP-seq, ATAC-seq, variant calling

  • Biostrings, GenomicRanges, GenomicFeatures, …

Differential gene expression from microarray or RNA-seq

  • SummarizedExperiment, limma, edgeR, DESeq2, …
  • Further packages to normalize, impute, batch correct, check quality, …

Single-cell gene expression, etc

Visualization

  • ComplexHeatmap, Gviz, ggbio, …

Statistical methods for \(p \gg n\) data

More types

Can do a lot in R with vectors and data frames.

Using Bioconductor will mean building familiarity with further types:

  • matrix, list

  • DNAString, DNAStringSet, GRanges ← focus for today

Seqinfo, TxDb, EnsDb, OrgDb, VCF, SummarizedExperiment, DelayedArray, …

✋      S3 types

Most base R packages use “S3” types. Data frames, tibbles, and linear model fits are examples of these.

  • essentially lists
  • if necessary, peek inside with $


👉👉  S4 types

Bioconductor uses “S4” types, including it’s own data frame (DataFrame) and list types (SimpleList, GRangesList, etc etc). If stuck, can almost always convert to base R types with as.data.frame, as.list, as.character, as.numeric.

  • use accessor functions such as seqnames, start, end, width, nchars
  • if absolutely necessary, peek inside with @


Stuart Lee’s guide to S4 for the perplexed

(do workshop)

Reference genomes and annotations

“High-throughput” biological data analysis usually occurs in the context of:

  • a genome assembly
  • gene and transcript annotations

For model organisms such as human and mouse there is a genome assembly most people use, updated infrequently, and several slightly different sets of gene and transcript annotations from different sources, updated much more frequently.


Available from:

Genome browsers

Files in bioinformatics

What File types R types
DNA sequence FASTA(I), FASTQ, 2bit(R) Biostrings::DNAStringSet, BSgenome(R), rtracklayer::TwoBitFile(R)
Amino acid sequence FASTA Biostrings::AAStringSet
Genomic features GTF(I), GFF(I), BED(I) GenomicRanges::GRanges, GenomicFeatures::TxDb(R), ensembldb::EnsDb(R)
Read alignments SAM, BAM(I) GenomicAlignments::GAlignments, Rsamtools::BamFile(I)
Numeric data along a genome wiggle, bigWig(R) list of numeric vectors, IRanges::RleList, rtracklayer::BigWigFile(R)
Variant calls VCF(I) VariantAnnotation::VCF
Numeric matrix (gene expression, etc) CSV, TSV, HDF5(R) matrix, HDF5Array(R), DelayedArray(R), SummarizedExperiment(R), SingleCellExperiment(R), …

(plus many more)

(R) random access to large files
(I) random access with an accompanying index file
Prefer these file types!

rtracklayer::import() can read many file types.

Command-line bioinformatics software

Not all bioinformatics software is an R package!

R’s role will often be to massage your data into the form needed for command line tools, or to examine a tool’s output.


One way to install command-line software is using the Conda package manager:

# Example
conda install -c bioconda meme
  • Can be used on your own computer, or on a server or cluster.
  • Doesn’t need admin rights.

(continue workshop)

Using Bioconductor

  1. Find some useful packages
  2. Read the vignettes
  3. Read the reference documentation for specific functions you need
  4. If you run into a funny class, check its documentation, work out the accessor functions, and in a pinch poke around its internals with @ or try as.data.frame.
?"GRanges-class"

methods(class="GRanges")