Bioconductor is a repository for R packages relating to high-throughput biology (shotgun sequencing, microarrays, etc).

  • Started in 2001, has developed alongside technology and growth of data

  • Core set of types promotes interoperation
    (Not just tidy data frames! Types have evolved over time to meet new needs.)

  • Twice yearly release schedule

  • Many contributors, but fairly good coding and documentation standards

Flavours of R

Base R

Functions and types that are always available in R.

CRAN (17,264 packages as at March 2021)

“Comprehensive R Archive Network” - main source of R packages

install.packages("glmnet")
install.packages("tidyverse")

Bioconductor (1,974 packages as at March 2021)

# Setup
install.packages("BiocManager")

# Install a Bioconductor package
BiocManager::install("limma")

# Check for out-of-date packages
BiocManager::install()

What Bioconductor is for

Genome and genome features: genes, regions, motifs, peaks, primers, SNPs ← focus for today

eg for ChIP-seq, ATAC-seq, variant calling

  • Biostrings, GenomicRanges, GenomicFeatures, VariantAnnotation, …

Differential testing of microarray, RNA-seq, and other high-throughput data

  • SummarizedExperiment, limma, edgeR, DESeq2, …
  • Further packages to normalize, impute, batch correct, …

Single-cell gene expression, etc

Visualization

  • ComplexHeatmap, Gviz, ggbio, glimma, …

Statistical methods for \(p \gg n\) data

More types

Can do a lot in R with vectors and data frames.

Using Bioconductor will mean building familiarity with further types:

  • matrix, list

  • DNAString, DNAStringSet, GRanges ← focus for today

Seqinfo, TxDb, EnsDb, OrgDb, VCF, SummarizedExperiment, DelayedArray, …

✋      S3 classes

Most base R packages use “S3” classes. Examples include data.frame, tibble, and linear model fits.

  • usually a list vector underneath
  • if necessary, peek inside with $


👉👉  S4 classes

Bioconductor uses “S4” classes, including novel classes such as GRanges and S4 equivalents of familiar classes such as DataFrame. Can usually convert to base R types with as.data.frame(), as.list(), as.character(), as.numeric(), as().

  • use accessor functions such as seqnames, start, end, width, nchars
  • if absolutely necessary, peek inside with @


Stuart Lee’s guide to S4 for the perplexed

(do workshop)

Genome browsers

Files in bioinformatics

What File types R types
DNA sequence FASTAI, FASTQ, 2bitR Biostrings::DNAStringSet, BSgenomeR, rtracklayer::TwoBitFileR
Amino acid sequence FASTA Biostrings::AAStringSet
Genomic features GTFI, GFFI, BEDI GenomicRanges::GRanges, GenomicFeatures::TxDbR, ensembldb::EnsDbR
Read alignments SAM, BAMI GenomicAlignments::GAlignments, Rsamtools::BamFileI
Numeric data along a genome wiggle, bigWigR list of numeric vectors, IRanges::RleList, rtracklayer::BigWigFileR
Variant calls VCFI VariantAnnotation::VCF
Numeric matrix (gene expression, etc) CSV, TSV, HDF5R matrix, HDF5ArrayR, DelayedArrayR, SummarizedExperimentR, SingleCellExperimentR, …

(plus many more)

 R = random access to large files
 I = random access with an accompanying index file
    Prefer these file types!

rtracklayer::import() can read many file types.

R indexing is 1-based

1-based

R, GFF and GTF feature files,
SAM alignment files, …

   +---+---+---+---+---+---+
   | A | C | G | T | A | C |
   +---+---+---+---+---+---+
     1   2   3   4   5   6
             |       |
             |<----->|        
    
    seq[3] == "G"   seq[3:5] == "GTA"

0-based

Python, C,
BED feature files, …

   +---+---+---+---+---+---+
   | A | C | G | T | A | C |
   +---+---+---+---+---+---+
   0   1   2   3   4   5   6
           |           |
           |<--------->|      

   seq[2] == "G"    seq[2:5] == "GTA"


Bioconductor file import functions will convert all data to 1-based.

Need to do own conversion if using read.table(), etc.

Reference genomes and annotations

“High-throughput” biological data analysis usually occurs in the context of:


  • a genome assembly ← Updated infrequently. RefSeq, Ensembl, UCSC all use same sequences.

  • gene and transcript annotations ← Updated frequently. Different between RefSeq, Ensembl, UCSC.


Main sources for model organisms:

UCSC was the original web-based genome browser. UCSC’s “KnownGene” gene annotations used to be the cutting edge gene annotation source, but UCSC now relies on other sources for gene annotations. Many file types that remain important today were developed for the UCSC genome browser, such as “bed”, “bigWig”, and “2bit”.

Genome assemblies are released infrequently. GRCh38 (hg38) was released in 2013. The previous assembly, GRCh37 (hg19) was released in 2009. Some people haven’t updated yet, you will find plenty of data using “hg19” positions! Gene and transcript annotations are updated far more frequently.

As well as the chromosomes in the “primary assembly” a genome assembly may have further sequences, which may have been added after the initial release:

 • patch sequences: fixes that would change the sizes of chromosomes
 • alt loci: a way to represent alleles, genetic diversity in the species

(return to workshop)