1 Installing Bioconductor packages

Bioconductor packages are installed using the CRAN package “BiocManager”.

## 1. Install BiocManager to your hard disk:
#
# install.packages("BiocManager")
#
## 2. Use BiocManager to install a BioConductor package to your hard disk:
#
# BiocManager::install("Biostrings")
#
## 3. Load the package into memory in your current R session:
#
# library(Biostrings)
#

To update any out-of-date packages use install() with no arguments. Restart your R session after doing this!

# BiocManager::install()

Bioconductor packages usually have useful documentation in the form of “vignettes”. These are readable on the Bioconductor website https://bioconductor.org/packages/release/BiocViews.html#___Software , from the RStudio “help” pane, or within R:

help(package="Biostrings")
# -> select "User guides, package vignettes, and other documentation"

vignette("BiostringsQuickOverview", package="Biostrings")

2 Main packages used in this workshop

library(Biostrings)      # Provides DNAString, DNAStringSet, etc
library(BSgenome)        # Provides getSeq()
library(GenomicRanges)   # Provides GRanges containing genomic ranges
library(rtracklayer)     # Provides import() and export()
library(Gviz)            # Provides plotting of genomic features

3 Sequences

3.1 DNAString

Package Biostrings offers classes for storing DNA strings, DNAString, amino acid sequences, AAString, or anything else in a BString. These are like character strings, but a variety of biologically meaningful functions can be applied to them.

myseq <- DNAString("CCGCGCACCAAC")
myseq

## 12-letter DNAString object
## seq: CCGCGCACCAAC

class(myseq)

## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"

reverseComplement(myseq)

## 12-letter DNAString object
## seq: GTTGGTGCGCGG

translate(myseq)

## 4-letter AAString object
## seq: PRTN

subseq(myseq, 3,5)

## 3-letter DNAString object
## seq: GCG

myseq[3:5]

## 3-letter DNAString object
## seq: GCG

as.character(myseq)

## [1] "CCGCGCACCAAC"

You can see a complete set of functions that work with DNAString with:

methods(class="DNAString")

You can see internal details of a class with:

getClass("DNAString")

The “slots”, used to store data internally in these objects, are accessible with, eg, myseq@length. However, wherever possible, we should leave the slots alone and use “methods” to access information in the object, eg length(myseq)! The classes that this class “extends” tell us some further ways we can interact with it.

You can get help on the DNAString class as below. Try this with classes it extends too!

?"DNAString-class"

3.2 DNAStringSet

Often we want to work with a list of sequences, such as chromosomes.

myset <- DNAStringSet( list(chrI=myseq, chrII=DNAString("ACGTACGT")) )
myset

## DNAStringSet object of length 2:
##     width seq                                               names               
## [1]    12 CCGCGCACCAAC                                      chrI
## [2]     8 ACGTACGT                                          chrII

# A DNAStringSet is list-like
myset$chrII

## 8-letter DNAString object
## seq: ACGTACGT

# or myset[["chrII"]]
# or myset[[2]]

3.3 Challenge: sequences

Reverse complement the following DNA sequence. Translate the result to an amino acid sequence.

"GTAGAGTAATATGGA"

Also translate bases 3 to 11 of the reverse complement.

4 Genomic ranges

We may then wish to refer to regions of these sequences, often with an associated strand. This is done with the GRanges type. GRanges builds on IRanges, “integer ranges”. An IRanges has starts and ends. A GRanges additionally has sequence names and strand information.

range1 <- GRanges("chrI", IRanges(start=3,end=5), "+")
range1

## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

getSeq(myset, range1)

## DNAStringSet object of length 1:
##     width seq
## [1]     3 GCG

range2 <- GRanges("chrI", IRanges(start=3,end=5), "-")
getSeq(myset, range2)

## DNAStringSet object of length 1:
##     width seq
## [1]     3 CGC

Accessing GRanges data:

seqnames(range1)

## factor-Rle of length 1 with 1 run
##   Lengths:    1
##   Values : chrI
## Levels(1): chrI

start(range1)

## [1] 3

end(range1)

## [1] 5

width(range1)

## [1] 3

strand(range1)

## factor-Rle of length 1 with 1 run
##   Lengths: 1
##   Values : +
## Levels(3): + - *

as.data.frame(range1)

##   seqnames start end width strand
## 1     chrI     3   5     3      +

Internals of an “S4” object such as a GRanges can be accessed using @, but this is discouraged. It is better to use the accessor functions above. Observe the completions RStudio offers when you type range1@.

# Look at completions for
# range1@

Further manipulations:

# GRanges are like vectors:
c(range1, range2)

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   [2]     chrI       3-5      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# GRanges can have metadata columns, and are often used like data frames:
mcols(range1)$wobble <- 10
range1

## GRanges object with 1 range and 1 metadata column:
##       seqnames    ranges strand |    wobble
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chrI       3-5      + |        10
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mcols(range1)

## DataFrame with 1 row and 1 column
##      wobble
##   <numeric>
## 1        10

range1$wobble

## [1] 10

# A handy way to create a GRanges
as("chrI:3-5:+", "GRanges")

## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

4.1 Question

Based on what we saw for DNAString, how can we learn more about using GRanges and IRanges objects?

4.2 Visualization

A couple of options for showing ranges are the GViz and ggbio packages. ggbio is a ggplot2 extension. ggbio is a great idea, but (as at late 2021) the package we could use some love to polish and update it. GViz takes a grid graphics approach.

Here we’ll use the GViz package, which is somewhat confusing to use but produces nice graphics out of the box.

options(ucscChromosomeNames=FALSE)
look <- function(features, labels=seq_along(features)) {
    axis_track <- GenomeAxisTrack()
    feature_track <- AnnotationTrack(
        features, id=labels, name="", 
        showFeatureId=TRUE, fontcolor.feature="black")
    plotTracks(
        list(axis_track, feature_track), 
        extend.left=1, extend.right=2)
}

look( c(range1, range2) )

5 Loading files

5.1 Loading sequences

DNA sequences are generally stored in FASTA format, a simple text format. These can be loaded with readDNAStringSet from Biostrings. Let’s load the genome of the Caeonrhabditis elegans worm, obtained from the Ensembl FTP site. This is the “soft masked” (dna_sm) version, with repetitive regions given in lower case, but we won’t be making use of this today.

### After "gunzip"ing, the start of the .fa file looks like this:
# >I dna_sm:chromosome chromosome:WBcel235:I:1:15072434:1 REF
# gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
# gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
# gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
# ...
# ttttttagaaaaattatttttaagaatttttcattttAGGAATATTGTTATTTCAGAAAA
# TAGCTAAATGTGATTTCTGTAATTTTGCCTGCCAAATTCGTGAAATGCAATAAAAATCTA
# ATATCCCTCATCAGTGCGATTTCCGAATCAGTATATTTTTACGTAATAGCTTCTTTGACA
# TCAATAAGTATTTGCCTATATGACTTTAGACTTGAAATTGGCTATTAATGCCAATTTCAT
# ...

seqs <- readDNAStringSet("Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa.gz")
seqs

## DNAStringSet object of length 7:
##        width seq                                            names               
## [1] 15072434 GCCTAAGCCTAAGCCTAAGCCT...GGCTTAGGTTTAGGCTTAGGC I dna_sm:chromoso...
## [2] 15279421 CCTAAGCCTAAGCCTAAGCCTA...AGGCTTAGGCTTAGGCTTAGT II dna_sm:chromos...
## [3] 13783801 CCTAAGCCTAAGCCTAAGCCTA...AGGCTTAGGCTTAGGCTTAGG III dna_sm:chromo...
## [4] 17493829 CCTAAGCCTAAGCCTAAGCCTA...AGGCTTAGGCTTAGGCTTAGG IV dna_sm:chromos...
## [5] 20924180 GAATTCCTAAGCCTAAGCCTAA...TTAGGCTTAGGCTTAGGCTTA V dna_sm:chromoso...
## [6] 17718942 CTAAGCCTAAGCCTAAGCCTAA...AGGCTTAGGCTTAGGCTTAGG X dna_sm:chromoso...
## [7]    13794 CAGTAAATAGTTTAATAAAAAT...ACTTTGTATATATCTATATTA MtDNA dna_sm:chro...

# Our chromosome name is too verbose.
# Remove everything from the name after the first space.
names(seqs)

## [1] "I dna_sm:chromosome chromosome:WBcel235:I:1:15072434:1 REF"     
## [2] "II dna_sm:chromosome chromosome:WBcel235:II:1:15279421:1 REF"   
## [3] "III dna_sm:chromosome chromosome:WBcel235:III:1:13783801:1 REF" 
## [4] "IV dna_sm:chromosome chromosome:WBcel235:IV:1:17493829:1 REF"   
## [5] "V dna_sm:chromosome chromosome:WBcel235:V:1:20924180:1 REF"     
## [6] "X dna_sm:chromosome chromosome:WBcel235:X:1:17718942:1 REF"     
## [7] "MtDNA dna_sm:chromosome chromosome:WBcel235:MtDNA:1:13794:1 REF"

names(seqs) <- sub(" .*","",names(seqs))
names(seqs)

## [1] "I"     "II"    "III"   "IV"    "V"     "X"     "MtDNA"

Conversely, a DNAStringSet can be written to a file with writeXStringSet.

5.2 Loading features

Genome annotations are available in a variety of text formats such as GFF3 and GTF. They can be loaded with the import function from rtracklayer. This GTF file is also from Ensembl, and gives the locations of the genes in the genome, and features within them.

### TSV with columns: seqname, source, start, end, score, strand, frame, attributes 
### The start of the .gtf file looks like this:
# #!genome-build WBcel235
# #!genome-version WBcel235
# #!genome-date 2012-12
# #!genome-build-accession GCA_000002985.3
# #!genebuild-last-updated 2019-01
# V       WormBase        gene    9244402 9246360 .       -       .       gene_id "WBGene00000003"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding";
# V       WormBase        transcript      9244402 9246360 .       -       .       gene_id "WBGene00000003"; transcript_id "F07C3.7.1"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding";
# V       WormBase        exon    9246080 9246360 .       -       .       gene_id "WBGene00000003"; transcript_id "F07C3.7.1"; exon_number "1"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; exon_id "F07C3.7.1.e1";
# V       WormBase        CDS     9246080 9246352 .       -       0       gene_id "WBGene00000003"; transcript_id "F07C3.7.1"; exon_number "1"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; protein_id "F07C3.7.1";
# V       WormBase        start_codon     9246350 9246352 .       -       0       gene_id "WBGene00000003"; transcript_id "F07C3.7.1"; exon_number "1"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding";
# ...

features <- import("Caenorhabditis_elegans.WBcel235.104.gtf.gz")

# Optional: just retain the columns of metadata we need
mcols(features) <- mcols(features)[,c("type","gene_name","gene_id","transcript_biotype","transcript_id")]

features

## GRanges object with 734022 ranges and 5 metadata columns:
##            seqnames          ranges strand |        type   gene_name
##               <Rle>       <IRanges>  <Rle> |    <factor> <character>
##        [1]        V 9244402-9246360      - | gene              aat-2
##        [2]        V 9244402-9246360      - | transcript        aat-2
##        [3]        V 9246080-9246360      - | exon              aat-2
##        [4]        V 9246080-9246352      - | CDS               aat-2
##        [5]        V 9246350-9246352      - | start_codon       aat-2
##        ...      ...             ...    ... .         ...         ...
##   [734018]    MtDNA     10403-11354      + |  transcript     MTCE.33
##   [734019]    MtDNA     10403-11354      + |  exon           MTCE.33
##   [734020]    MtDNA     13275-13327      + |  gene           MTCE.36
##   [734021]    MtDNA     13275-13327      + |  transcript     MTCE.36
##   [734022]    MtDNA     13275-13327      + |  exon           MTCE.36
##                   gene_id transcript_biotype transcript_id
##               <character>        <character>   <character>
##        [1] WBGene00000003               <NA>          <NA>
##        [2] WBGene00000003     protein_coding     F07C3.7.1
##        [3] WBGene00000003     protein_coding     F07C3.7.1
##        [4] WBGene00000003     protein_coding     F07C3.7.1
##        [5] WBGene00000003     protein_coding     F07C3.7.1
##        ...            ...                ...           ...
##   [734018] WBGene00014472               rRNA       MTCE.33
##   [734019] WBGene00014472               rRNA       MTCE.33
##   [734020] WBGene00014473               <NA>          <NA>
##   [734021] WBGene00014473               tRNA       MTCE.36
##   [734022] WBGene00014473               tRNA       MTCE.36
##   -------
##   seqinfo: 7 sequences from an unspecified genome; no seqlengths

6 Seqinfo and genome versions

Various objects have associated Seqinfo information, listing the chromosomes they refer to and their lengths. This allows some extra sanity checking, and is also necessary for some tasks.

seqinfo(features)

## Seqinfo object with 7 sequences from an unspecified genome; no seqlengths:
##   seqnames seqlengths isCircular genome
##   I                NA         NA   <NA>
##   II               NA         NA   <NA>
##   III              NA         NA   <NA>
##   IV               NA         NA   <NA>
##   V                NA         NA   <NA>
##   X                NA         NA   <NA>
##   MtDNA            NA         NA   <NA>

seqinfo(seqs)

## Seqinfo object with 7 sequences from an unspecified genome:
##   seqnames seqlengths isCircular genome
##   I          15072434         NA   <NA>
##   II         15279421         NA   <NA>
##   III        13783801         NA   <NA>
##   IV         17493829         NA   <NA>
##   V          20924180         NA   <NA>
##   X          17718942         NA   <NA>
##   MtDNA         13794         NA   <NA>

myseqinfo <- seqinfo(seqs)
isCircular(myseqinfo) <- c(rep(FALSE,6),TRUE)

seqinfo(features) <- myseqinfo

One thing you need to care about is the genome version. We often refer to genome versions by their name in the UCSC genome browser. For example the two most recent versions of the worm genome are “ce10” from 2010 and “ce11” from 2013. Similarly with human data you may find data for either the “hg19” or “hg38” versions of the human genome.

Seqinfo(genome="ce10")

## Seqinfo object with 7 sequences (1 circular) from ce10 genome:
##   seqnames seqlengths isCircular genome
##   chrI       15072423      FALSE   ce10
##   chrII      15279345      FALSE   ce10
##   chrIII     13783700      FALSE   ce10
##   chrIV      17493793      FALSE   ce10
##   chrV       20924149      FALSE   ce10
##   chrX       17718866      FALSE   ce10
##   chrM          13794       TRUE   ce10

Seqinfo(genome="ce11")

## Seqinfo object with 7 sequences (1 circular) from ce11 genome:
##   seqnames seqlengths isCircular genome
##   chrI       15072434      FALSE   ce11
##   chrII      15279421      FALSE   ce11
##   chrIII     13783801      FALSE   ce11
##   chrIV      17493829      FALSE   ce11
##   chrV       20924180      FALSE   ce11
##   chrX       17718942      FALSE   ce11
##   chrM          13794       TRUE   ce11

Notice the slightly different chromosome lengths! Features from one genome version will not be correctly located on a different genome version.

Our ENSEMBL sequences match “ce11”, also called “WBcel235”. However the chromosome names are slightly different, a common source of pain. See ?seqlevelsStyle.

7 Querying GRanges objects

The metadata columns let us query the GRanges, for example for a feature type:

subset(features, type == "CDS")

# Equivalently:
#   features[features$type == "CDS"]

## GRanges object with 225595 ranges and 5 metadata columns:
##            seqnames            ranges strand |     type   gene_name
##               <Rle>         <IRanges>  <Rle> | <factor> <character>
##        [1]        V   9246080-9246352      - |      CDS       aat-2
##        [2]        V   9245588-9246033      - |      CDS       aat-2
##        [3]        V   9245443-9245539      - |      CDS       aat-2
##        [4]        V   9244683-9245315      - |      CDS       aat-2
##        [5]        V 11470583-11470598      - |      CDS       aat-6
##        ...      ...               ...    ... .      ...         ...
##   [225591]    MtDNA         6506-7732      + |      CDS      nduo-4
##   [225592]    MtDNA         7845-9419      + |      CDS       ctc-1
##   [225593]    MtDNA        9649-10341      + |      CDS       ctc-2
##   [225594]    MtDNA       11356-11688      + |      CDS      nduo-3
##   [225595]    MtDNA       11691-13271      + |      CDS      nduo-5
##                   gene_id transcript_biotype transcript_id
##               <character>        <character>   <character>
##        [1] WBGene00000003     protein_coding     F07C3.7.1
##        [2] WBGene00000003     protein_coding     F07C3.7.1
##        [3] WBGene00000003     protein_coding     F07C3.7.1
##        [4] WBGene00000003     protein_coding     F07C3.7.1
##        [5] WBGene00000007     protein_coding    T11F9.4a.1
##        ...            ...                ...           ...
##   [225591] WBGene00010963     protein_coding     MTCE.25.1
##   [225592] WBGene00010964     protein_coding     MTCE.26.1
##   [225593] WBGene00010965     protein_coding     MTCE.31.1
##   [225594] WBGene00010966     protein_coding     MTCE.34.1
##   [225595] WBGene00010967     protein_coding     MTCE.35.1
##   -------
##   seqinfo: 7 sequences (1 circular) from an unspecified genome

Note: subset is a generic R function. It is similar to dplyr’s filter. The second argument is special, in it you can refer to columns of the GRanges directly.

8 Representation of genes in a GTF file

Let’s have a look at the different feature types in a particular gene.

trx1_features <- subset(features, gene_name == "trx-1")
# Equivalently:
#   features[features$gene_name == "trx-1" & !is.na(features$gene_name)]

trx1_features

## GRanges object with 25 ranges and 5 metadata columns:
##        seqnames          ranges strand |            type   gene_name
##           <Rle>       <IRanges>  <Rle> |        <factor> <character>
##    [1]       II 7752637-7753607      - |     gene              trx-1
##    [2]       II 7752637-7753513      - |     transcript        trx-1
##    [3]       II 7753283-7753513      - |     exon              trx-1
##    [4]       II 7753283-7753489      - |     CDS               trx-1
##    [5]       II 7753487-7753489      - |     start_codon       trx-1
##    ...      ...             ...    ... .             ...         ...
##   [21]       II 7752637-7752885      - | exon                  trx-1
##   [22]       II 7752820-7752885      - | CDS                   trx-1
##   [23]       II 7752817-7752819      - | stop_codon            trx-1
##   [24]       II 7753540-7753607      - | five_prime_utr        trx-1
##   [25]       II 7752637-7752816      - | three_prime_utr       trx-1
##               gene_id transcript_biotype transcript_id
##           <character>        <character>   <character>
##    [1] WBGene00015062               <NA>          <NA>
##    [2] WBGene00015062     protein_coding    B0228.5a.1
##    [3] WBGene00015062     protein_coding    B0228.5a.1
##    [4] WBGene00015062     protein_coding    B0228.5a.1
##    [5] WBGene00015062     protein_coding    B0228.5a.1
##    ...            ...                ...           ...
##   [21] WBGene00015062     protein_coding    B0228.5b.1
##   [22] WBGene00015062     protein_coding    B0228.5b.1
##   [23] WBGene00015062     protein_coding    B0228.5b.1
##   [24] WBGene00015062     protein_coding    B0228.5b.1
##   [25] WBGene00015062     protein_coding    B0228.5b.1
##   -------
##   seqinfo: 7 sequences (1 circular) from an unspecified genome

look(trx1_features, trx1_features$type)

Look at the different types in the “type” column. Each “gene” may have multiple “transcript” features (isoforms). Each transcript in turn has a set of “exon” features, and if it is a protein coding gene, a set of “CDS” (coding sequence) features. The CDS features cover a subset of the bases covered by the exon features. The “gene_id” and “transcript_id” columns let us know which transcript and gene each exon belongs to.

# --------------------------------------------------> gene
# 
# -------------------------------------------->       transcript
# ---------->         --->    ---------------->       exon
#       ---->         --->    ---------->             CDS
# 
# 
#                -----------------------------------> transcript
#                -------->       ---->    ----------> exon
#                     --->       ---->    -->         CDS

Let’s look at this in the Integrative Genome Browser (IGV):

Open IGV.
Select the genome “C. elegans (ce11)” (top left drop-down box).
Load the GTF file (“File” menu, “Load from file…”).
Search for “trx-1” in the location box (next to the “Go” button).

IGV turns all these features into a nice diagram of the gene’s transcripts. The ENSEMBL gene annotation contains 5’UTR and 3’UTR regions, which are missing from the default gene annotations shown by IGV (probably from NCBI’s RefSeq). Gviz or ggbio can be used to produce diagrams like this from within R (with a little work).

9 Further operations on GRanges

💡 Using accessor methods like “start()” and “strand()” frees the authors of GRanges to store data internally using whatever efficiency tricks they want, and even to change this in a new version of Bioconductor. This is good, but it can be better:

💡 Double-stranded DNA has a rotational symmetry. None of the physics of DNA is changed if we look at it rotated end-to-end 180 degrees. Labelling of the strands as “+” and “-” is arbitrary. Also, it is usually not the exact position of features in a genome that is important so much as their relative position. If we restrict outselves to using functions that work just as well with either labelling of strands, and that are not affected by absolute position, our code will more directly express our intention and be less prone to bugs.

9.1 Intra-range

Various useful manipulations of individual ranges are defined.

?"intra-range-methods"

Note: How these make use of the strand is a little haphazard. For example flank(), resize(), and promoters() respect strand but shift() does not.

For example, suppose we are interested in the first 100 bases of a transcript:

feat <- features[2]
feat_initial <- resize(feat, 100, fix="start")

look( c(feat, feat_initial) )

getSeq(seqs, feat_initial)

## DNAStringSet object of length 1:
##     width seq
## [1]   100 ACGCAAAAATGAACGAAAAAGAAGAAGAAGTATC...ACTTTTCAATGGCTGCACAATCATTATCGGAGT

resize can fix either the fix="start" or fix="end" of the sequence.

flank can be either flank the start (start=TRUE) or end (start=FALSE).

#                                  5'     3'
# input                            ------->
#                                  .      .
# resize(input, n, fix="start")    -->    .
# resize(input, n, fix="end")      .    -->
#                                  .      .
# flank(input, n, start=TRUE)   -->.      .
# flank(input, n, start=FALSE)     .      .-->
#                                  .      .
# promoters(input, n1,n2)     -------->   .
#

9.2 Inter-range

?"inter-range-methods"
?"nearest-methods"
?"setops-methods"

One compelling feature of GenomicRanges is that it is able to find overlapping ranges very quickly.

query <- as("II:7752600-7754000:-", "GRanges")
hits <- findOverlaps(query, features, ignore.strand=TRUE)
hits

## Hits object with 35 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         1      488078
##    [2]         1      488079
##    [3]         1      488121
##    [4]         1      488124
##    [5]         1      488125
##    ...       ...         ...
##   [31]         1      488255
##   [32]         1      488256
##   [33]         1      488257
##   [34]         1      488258
##   [35]         1      488259
##   -------
##   queryLength: 1 / subjectLength: 734022

subjectHits(hits)

##  [1] 488078 488079 488121 488124 488125 488173 488176 488207 488231 488234
## [11] 488235 488236 488237 488238 488239 488240 488241 488242 488243 488244
## [21] 488245 488246 488247 488248 488249 488250 488251 488252 488253 488254
## [31] 488255 488256 488257 488258 488259

features[subjectHits(hits)]

## GRanges object with 35 ranges and 5 metadata columns:
##        seqnames          ranges strand |            type   gene_name
##           <Rle>       <IRanges>  <Rle> |        <factor> <character>
##    [1]       II 7724903-7752715      + | gene                 cpna-2
##    [2]       II 7724903-7752715      + | transcript           cpna-2
##    [3]       II 7752309-7752715      + | exon                 cpna-2
##    [4]       II 7752429-7752715      + | three_prime_utr      cpna-2
##    [5]       II 7724903-7752715      + | transcript           cpna-2
##    ...      ...             ...    ... .             ...         ...
##   [31]       II 7752637-7752885      - | exon                  trx-1
##   [32]       II 7752820-7752885      - | CDS                   trx-1
##   [33]       II 7752817-7752819      - | stop_codon            trx-1
##   [34]       II 7753540-7753607      - | five_prime_utr        trx-1
##   [35]       II 7752637-7752816      - | three_prime_utr       trx-1
##               gene_id transcript_biotype transcript_id
##           <character>        <character>   <character>
##    [1] WBGene00015061               <NA>          <NA>
##    [2] WBGene00015061     protein_coding    B0228.4a.1
##    [3] WBGene00015061     protein_coding    B0228.4a.1
##    [4] WBGene00015061     protein_coding    B0228.4a.1
##    [5] WBGene00015061     protein_coding    B0228.4c.1
##    ...            ...                ...           ...
##   [31] WBGene00015062     protein_coding    B0228.5b.1
##   [32] WBGene00015062     protein_coding    B0228.5b.1
##   [33] WBGene00015062     protein_coding    B0228.5b.1
##   [34] WBGene00015062     protein_coding    B0228.5b.1
##   [35] WBGene00015062     protein_coding    B0228.5b.1
##   -------
##   seqinfo: 7 sequences (1 circular) from an unspecified genome

findOverlaps(query, features, ignore.strand=FALSE)

## Hits object with 25 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         1      488235
##    [2]         1      488236
##    [3]         1      488237
##    [4]         1      488238
##    [5]         1      488239
##    ...       ...         ...
##   [21]         1      488255
##   [22]         1      488256
##   [23]         1      488257
##   [24]         1      488258
##   [25]         1      488259
##   -------
##   queryLength: 1 / subjectLength: 734022

With findOverlaps, we can use genomic location as the key when joining disparate types of data, so this is an important tool for integrative analysis. See also the related functions nearest, precede, follow, and distance.

GenomicRanges also provides:

range - get a feature that spans from the start to the end of all features in a GRanges.
reduce - merge overlapping features, so that the same bases are covered by a reduced collection of features.
disjoin - as with reduce, but broken at each start and end of the input features.
setdiff - subtracts one set of features from another, could be used with range on a set of exons to get introns. Might need to use GenomicRanges::setdiff if also using dplyr.

# input
#         --------->
#              ----------------->
#                      ------>
#                                     ---->
#                                     
# range   -------------------------------->
# 
# reduce  ----------------------->    ---->
# 
# disjoin ---->---->-->------>--->    ---->
# 
# setdiff(range(input),input)     --->

trx1_exons <- subset(trx1_features, type == "exon")
look( trx1_exons )

look( range(trx1_exons) )

look( reduce(trx1_exons) )

look( disjoin(trx1_exons) )

9.3 Challenge: transcript initiation

transcripts <- subset(features, type == "transcript")

Make a GRanges of all transcripts with a biotype of “protein_coding”.
Make a GRanges of the two bases immediately preceding each of these transcripts.
Get the DNA sequences of the ranges from part 2. Use table() to see if any sequences are particularly common.

Advanced: Get the four bases spanning the start of the transcript (two upstrand of the start and two downstrand of the start).

Note: In C. elegans transcript initiation is complicated by trans-splicing. I wonder what exactly the 5’ ends of these transcript annotations represent?

10 Example: Examining the PolyAdenylation Signal (PAS)

We’ll now look at an example where the location of the thing we’re looking for is more uncertain.

We suspect there is some motif that causes cleavage and polyadenylation of RNA transcripts, close to the 3’ end of each transcript. Bioconductor will let us explore sequence around the ends of transcripts. It can also help prepare data for command-line software such as the MEME Suite.

We’ll limit our attention to protein coding transcripts.

transcripts <- subset(features, 
    type == "transcript" & transcript_biotype == "protein_coding")

ends <- resize(transcripts, width=30, fix="end")
end_seqs <- getSeq(seqs, ends)
names(end_seqs) <- ends$transcript_id

We can check for general biasses in base composition:

library(seqLogo)  

letter_counts <- consensusMatrix(end_seqs)

props <- prop.table(letter_counts[1:4,], 2)
seqLogo(props, ic.scale=FALSE)

seqLogo(props)

Saving the sequences to a file lets us apply command-line tools:

writeXStringSet(end_seqs, "end_seqs.fasta")

This is suitable for use with, for example, the streme tool from MEME.

# streme --rna --minw 4 --maxw 15 --p end_seqs.fasta

Ok ok, so actually there is known PolyAdenylation Signal (PAS) motif, canonically “AAUAAA”. Let’s now look for this known motif.

counts  <- vcountPattern("AATAAA", end_seqs)
table(counts)

## counts
##     0     1     2     3     4     5 
## 22521 10500   501    27     2     1

matches <- vmatchPattern("AATAAA", end_seqs)
plot( tabulate(start(unlist(matches)),30), xlab="start position",ylab="matches")

We could also look for this pattern in the genome in general. Here is some code to find matches on either strand, using vmatchPattern().

query <- DNAString("AATAAA")
max.mismatch <- 0

fwd <- vmatchPattern(query, seqs, max.mismatch=max.mismatch)
fwd <- as(fwd, "GRanges")
strand(fwd) <- "+"

rev <- vmatchPattern(reverseComplement(query), seqs, max.mismatch=max.mismatch)
rev <- as(rev, "GRanges")
strand(rev) <- "-"

matches <- c(fwd, rev)
matches

## GRanges object with 276631 ranges and 0 metadata columns:
##            seqnames      ranges strand
##               <Rle>   <IRanges>  <Rle>
##        [1]        I   1309-1314      +
##        [2]        I   2406-2411      +
##        [3]        I   2577-2582      +
##        [4]        I   2649-2654      +
##        [5]        I   2653-2658      +
##        ...      ...         ...    ...
##   [276627]    MtDNA 13477-13482      -
##   [276628]    MtDNA 13520-13525      -
##   [276629]    MtDNA 13563-13568      -
##   [276630]    MtDNA 13606-13611      -
##   [276631]    MtDNA 13651-13656      -
##   -------
##   seqinfo: 7 sequences from an unspecified genome; no seqlengths

Once we have a set of ranges, they can be related to other sets of ranges. For example, to reproduce the earlier result:

overlaps <- findOverlaps(matches, ends, type="within")
table(table( subjectHits(overlaps) ))

## 
##     1     2     3     4     5 
## 10500   501    27     2     1

The matches can be examined in IGV if saved in a GFF file. With a “tabix” index, IGV avoids the need to load the whole file.

export(matches, "motif-matches.gff", index=TRUE)

The pairwiseAlignment() function provides more flexibility than vmatchPattern(), if needed. Command-line tools such as BLAST will be faster for longer query sequences.

Similar exploration can be performed around other regions of interest, such as peaks identified from ChIP. R may either provide a complete solution or serve as the glue plugging command-line tools together.

11 BAMs and bigWigs

import() can read various other file types. This section will demonstrate reading read alignments from a BAM file and producing a bigWig depth of coverage file. The reads in the BAM file used here is a small sample from GSE57993 sample “N2 Rep1”. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57993

Both BAM and bigWig files are designed so that specific regions can be accessed efficiently. Genome browsers can use this to only load data as needed. BigWig files also contain versions of their data at different resolutions, so a zoomed-out view can be quickly loaded. Bioconductor packages will also usually offer some way to load specific regions. This is useful because these files can be very large.

alignments <- import("example.bam")
alignments

## GAlignments object with 168060 alignments and 0 metadata columns:
##            seqnames strand       cigar    qwidth     start       end     width
##               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##        [1]        I      -        118M       118      2574      2691       118
##        [2]        I      -        131M       131      4120      4250       131
##        [3]        I      -         64M        64      4120      4183        64
##        [4]        I      +        151M       151     16596     16746       151
##        [5]        I      +         30M        30     16614     16643        30
##        ...      ...    ...         ...       ...       ...       ...       ...
##   [168056]    MtDNA      +         61M        61     13734     13794        61
##   [168057]    MtDNA      +         61M        61     13734     13794        61
##   [168058]    MtDNA      +         61M        61     13734     13794        61
##   [168059]    MtDNA      +         51M        51     13744     13794        51
##   [168060]    MtDNA      +         21M        21     13774     13794        21
##                njunc
##            <integer>
##        [1]         0
##        [2]         0
##        [3]         0
##        [4]         0
##        [5]         0
##        ...       ...
##   [168056]         0
##   [168057]         0
##   [168058]         0
##   [168059]         0
##   [168060]         0
##   -------
##   seqinfo: 7 sequences from an unspecified genome

depth <- coverage(alignments)
depth

## RleList of length 7
## $I
## integer-Rle of length 15072434 with 15925 runs
##   Lengths:  2573   118  1428    64    67 ...     3     4   426   151  1476
##   Values :     0     1     0     2     1 ... 17649 17648     0     1     0
## 
## $II
## integer-Rle of length 15279421 with 11569 runs
##   Lengths:  5498    33 10382    91  9048 ...    45     6 26915   114  7453
##   Values :     0     1     0     1     0 ...     3     2     0     1     0
## 
## $III
## integer-Rle of length 13783801 with 14706 runs
##   Lengths: 17253    86  3187     6    30 ...   116    47  2034   151  2709
##   Values :     0     1     0     3     4 ...     0     1     0     1     0
## 
## $IV
## integer-Rle of length 17493829 with 13544 runs
##   Lengths: 38992   128 69850    30  1689 ...     1     5     3    21 23096
##   Values :     0     1     0     1     0 ...     9     8     2     1     0
## 
## $V
## integer-Rle of length 20924180 with 12136 runs
##   Lengths:  6583     3     2     2     5 ...    72    25  8185    67 16741
##   Values :     0     2     9    10    16 ...     0     1     0     1     0
## 
## ...
## <2 more elements>

# Depth of coverage for gene rps-12
plot(depth$III[5679200:5680200], type="l")

export(depth, "depth.bw")

Explore the documentation for “GAlignments”, “Rle”, and “RleList”. “Rle” and “RleList” objects are memory-efficient vectors and lists of vectors. They support arithmetic operation such as scaling with * and adding with +. Rather neat! Sometimes a function won’t support “Rle”s. When this happens, an Rle can be converted to a conventional vector with as.numeric().

Filtered import of a BAM file can be performed with a suitable use of ScanBamParam from the Rsamtools package, import("example.bam", param=ScanBamParam( ... )).

Specific strand.
Specific region of reference genome.
Specific set of single cells, based on CB attribute.

Try loading the “motif-matches.gff”, “example.bam”, and “depth.bw” files into IGV. Look for example at gene “rps-12”.

(return to slideshow)

12 Genome and annotation resources

Besides software, Bioconductor includes various types of reference data.

12.1 Example packages

BSgenome.Hsapiens.UCSC.hg38 Biostrings genome, Homo sapiens, from the UCSC browser, version hg38. DNA for chromosomes, usable in the same way as the DNAStringSet used above.
TxDb.Hsapiens.UCSC.hg38.knownGene Transcript database, Homo sapiens, from UCSC browser, genome verison hg38, “knownGene” gene annotations. GRanges information for genes and transcripts, much as we loaded from a GTF file above.
org.Hs.eg.db Organism Homo sapiens, primary key is Entrez Gene, database. Translation of gene ids from various databases, assignment to GO terms, KEGG pathways, etc. Entrez Gene ids are used as the primary key.
GO.db GO Gene Ontology term descriptions and relationships. However to find which terms apply to specific genes, use the GOALL column in the relevant species’ organism database.

Loading any of these packages gives you an AnnotationDb object (defined in the package AnnotationDbi). The main way to query these objects is with the select() function. mapIds() can be used if you need to enforce a one-to-one mapping.

library(org.Ce.eg.db)

## Loading required package: AnnotationDbi

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

##

keytypes(org.Ce.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [16] "PMID"         "REFSEQ"       "SYMBOL"       "UNIPROT"      "WORMBASE"

columns(org.Ce.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GO"           "GOALL"        "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [16] "PMID"         "REFSEQ"       "SYMBOL"       "UNIPROT"      "WORMBASE"

AnnotationDbi::select(org.Ce.eg.db,
    keys="trx-1", keytype="SYMBOL", columns=c("ENSEMBL","ENTREZID"))

## 'select()' returned 1:1 mapping between keys and columns

##   SYMBOL        ENSEMBL ENTREZID
## 1  trx-1 WBGene00015062   181863

head( AnnotationDbi::select(org.Ce.eg.db, 
    keys="WBGene00015062", keytype="ENSEMBL", columns="GOALL") )

## 'select()' returned 1:many mapping between keys and columns

##          ENSEMBL      GOALL EVIDENCEALL ONTOLOGYALL
## 1 WBGene00015062 GO:0003674         IDA          MF
## 2 WBGene00015062 GO:0003674         IEA          MF
## 3 WBGene00015062 GO:0003824         IDA          MF
## 4 WBGene00015062 GO:0003824         IEA          MF
## 5 WBGene00015062 GO:0005575         IDA          CC
## 6 WBGene00015062 GO:0005622         IDA          CC

library(GO.db)

##

AnnotationDbi::select(GO.db, 
    keys="GO:0005622", keytype="GOID", columns="TERM")

## 'select()' returned 1:1 mapping between keys and columns

##         GOID                               TERM
## 1 GO:0005622 intracellular anatomical structure

12.2 Transcript database objects

We’ve been using our genomic features as one big unstructured GRanges. This is messy. TxDb objects offer a more structured representation of genes, transcripts, exons, and coding sequences.

library(GenomicFeatures)

## Warning: package 'GenomicFeatures' was built under R version 4.1.1

txdb <- makeTxDbFromGRanges(features)
txdb

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Genome: NA
## # Nb of transcripts: 61451
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-09-24 04:45:52 +1000 (Fri, 24 Sep 2021)
## # GenomicFeatures version at creation time: 1.45.2
## # RSQLite version at creation time: 2.2.8
## # DBSCHEMAVERSION: 1.2

txdb is a “TxDb” object. Some ways to extract data from a TxDb are:

genes(txdb)
transcriptsBy(txdb, by="gene")
exonsBy(txdb, use.names=TRUE)
cdsBy(txdb, use.names=TRUE)

cds_ranges <- cdsBy(txdb, use.names=TRUE)
cds_ranges$B0228.5a.1
cds_ranges[["B0228.5a.1"]]
unlist(cds_ranges)

cds_ranges here is a “GRangesList”. That is, a list containing GRanges objects. To get the transcript sequence or the coding sequence, these each need to be retrieved and then concatenated together. extractTranscriptSeqs() can do this for us.

extractTranscriptSeqs(seqs, cds_ranges)

## DNAStringSet object of length 33552:
##         width seq                                           names               
##     [1]   372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.3
##     [2]   372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.1
##     [3]   372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.4
##     [4]   372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.5
##     [5]   372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.2
##     ...   ... ...
## [33548]  1227 TTGTTAGAATTTCTTTTTATC...TTTTGATTAAGTGTGTTTTAC MTCE.25.1
## [33549]  1575 ATTAATCTTTATAAAAAATAT...AGAACTACTAGATTAAAAAAT MTCE.26.1
## [33550]   693 ATTAATAATTTTTTTCAAGGA...TGATGTTTTGGTACTATAGAA MTCE.31.1
## [33551]   333 ATTTTAGTGCTATTAATAGTT...GGTAAATTAGTTTGAGTAATT MTCE.34.1
## [33552]  1581 ATTAATATTTCTATTTTTTTG...TTTATTTTTTTTATAATTTGT MTCE.35.1

A TxDb can also be accessed like an AnnotationDb.

keytypes(txdb)

## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"

columns(txdb)

##  [1] "CDSCHROM"   "CDSEND"     "CDSID"      "CDSNAME"    "CDSPHASE"  
##  [6] "CDSSTART"   "CDSSTRAND"  "EXONCHROM"  "EXONEND"    "EXONID"    
## [11] "EXONNAME"   "EXONRANK"   "EXONSTART"  "EXONSTRAND" "GENEID"    
## [16] "TXCHROM"    "TXEND"      "TXID"       "TXNAME"     "TXSTART"   
## [21] "TXSTRAND"   "TXTYPE"

Under the hood these are all SQLite databases. select() joins tables together as needed to answer a query.

DBI::dbListTables( dbconn(txdb) )

## [1] "cds"        "chrominfo"  "exon"       "gene"       "metadata"  
## [6] "splicing"   "transcript"

DBI::dbListFields( dbconn(txdb), "transcript" )

## [1] "_tx_id"    "tx_name"   "tx_type"   "tx_chrom"  "tx_strand" "tx_start" 
## [7] "tx_end"

There’s much more to explore here. Have a look at the documentation for the GenomicFeatures and ensembldb packages. To annotate a set of genomic features such as peaks from a ChIP-seq experiment, see for example the ChIPseeker package.

12.3 biomaRt

BioMart servers accessed using the biomaRt package are another way to get information such as translation of gene ids, gene sets, and gene information. For reproducibility, if using the ENSEMBL servers, make sure to specify a specific version of Ensembl to use. Your scripts may fail in future if the server you are using goes down or changes how it is accessed.

library(biomaRt)

mart <- useEnsembl(biomart = "genes", dataset = "celegans_gene_ensembl", version=104)

getBM(
    attributes=c("external_gene_name", "ensembl_gene_id", "entrezgene_id"), 
    filters="external_gene_name", values="trx-1", mart=mart)

12.4 AnnotationHub

AnnotationHub is a way to retrieve data from a more comprehensive set of organisms and data providers than are provided as individual Bioconductor packages. The retrieved data is returned in an appropriate Bioconductor type. If data is being updated over time (eg improved annotation of a genome), each version receives a unique ID in AnnotationHub, making it possible to write reproducible analyses.

Files are cached, so they will only be downloaded once.

In the example below, the yeast genome and Ensembl annotations are retrieved:

library(AnnotationHub)
ah <- AnnotationHub()

# ah contains a large collection of records that can be retrieved
ah
length(ah)
colnames( mcols(ah) )
table( ah$rdataclass )

# There is an interactive Shiny search interface
display(ah)

# query() searches for terms in an unstructured way
query(ah, c("Ensembl", "96", "Saccharomyces cerevisiae"))

# Having located records of interest,
# your R script can refer to the specific AH... record,
# so it always uses the same version of the data.
sc_genome  <- ah[["AH70449"]]
sc_granges <- ah[["AH69700"]]
sc_txdb    <- ah[["AH69265"]]

# sc_genome is a TwoBitFile
# Can use getSeq on it without loading everything into memory
seqinfo(sc_genome)
getSeq(sc_genome, as("IV:1-10","GRanges"))
import(sc_genome)

13 Package versions used

Use sessionInfo() to check the versions of packages used.

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GenomicFeatures_1.45.2 GO.db_3.13.0           org.Ce.eg.db_3.13.0   
##  [4] AnnotationDbi_1.55.1   Biobase_2.53.0         seqLogo_1.59.0        
##  [7] Gviz_1.37.2            BSgenome_1.61.0        rtracklayer_1.53.1    
## [10] GenomicRanges_1.45.0   Biostrings_2.61.2      GenomeInfoDb_1.29.8   
## [13] XVector_0.33.0         IRanges_2.27.2         S4Vectors_0.31.3      
## [16] BiocGenerics_0.39.2   
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-2            rjson_0.2.20               
##   [3] ellipsis_0.3.2              biovizBase_1.41.0          
##   [5] htmlTable_2.2.1             base64enc_0.1-3            
##   [7] dichromat_2.0-0             rstudioapi_0.13            
##   [9] bit64_4.0.5                 fansi_0.5.0                
##  [11] xml2_1.3.2                  splines_4.1.0              
##  [13] cachem_1.0.6                knitr_1.34                 
##  [15] Formula_1.2-4               jsonlite_1.7.2             
##  [17] Rsamtools_2.9.1             cluster_2.1.2              
##  [19] dbplyr_2.1.1                png_0.1-7                  
##  [21] compiler_4.1.0              httr_1.4.2                 
##  [23] backports_1.2.1             assertthat_0.2.1           
##  [25] Matrix_1.3-4                fastmap_1.1.0              
##  [27] lazyeval_0.2.2              htmltools_0.5.2            
##  [29] prettyunits_1.1.1           tools_4.1.0                
##  [31] gtable_0.3.0                glue_1.4.2                 
##  [33] GenomeInfoDbData_1.2.6      dplyr_1.0.7                
##  [35] rappdirs_0.3.3              Rcpp_1.0.7                 
##  [37] jquerylib_0.1.4             vctrs_0.3.8                
##  [39] xfun_0.26                   stringr_1.4.0              
##  [41] lifecycle_1.0.0             restfulr_0.0.13            
##  [43] ensembldb_2.17.4            XML_3.99-0.8               
##  [45] zlibbioc_1.39.0             scales_1.1.1               
##  [47] VariantAnnotation_1.39.0    hms_1.1.0                  
##  [49] MatrixGenerics_1.5.4        ProtGenerics_1.25.1        
##  [51] parallel_4.1.0              SummarizedExperiment_1.23.4
##  [53] AnnotationFilter_1.17.1     RColorBrewer_1.1-2         
##  [55] yaml_2.2.1                  curl_4.3.2                 
##  [57] memoise_2.0.0               gridExtra_2.3              
##  [59] ggplot2_3.3.5               sass_0.4.0                 
##  [61] biomaRt_2.49.4              rpart_4.1-15               
##  [63] latticeExtra_0.6-29         stringi_1.7.4              
##  [65] RSQLite_2.2.8               highr_0.9                  
##  [67] BiocIO_1.3.0                checkmate_2.0.0            
##  [69] filelock_1.0.2              BiocParallel_1.27.9        
##  [71] rlang_0.4.11                pkgconfig_2.0.3            
##  [73] matrixStats_0.61.0          bitops_1.0-7               
##  [75] evaluate_0.14               lattice_0.20-44            
##  [77] purrr_0.3.4                 GenomicAlignments_1.29.0   
##  [79] htmlwidgets_1.5.4           bit_4.0.4                  
##  [81] tidyselect_1.1.1            magrittr_2.0.1             
##  [83] R6_2.5.1                    generics_0.1.0             
##  [85] Hmisc_4.5-0                 DelayedArray_0.19.3        
##  [87] DBI_1.1.1                   pillar_1.6.2               
##  [89] foreign_0.8-81              survival_3.2-13            
##  [91] KEGGREST_1.33.0             RCurl_1.98-1.5             
##  [93] nnet_7.3-16                 tibble_3.1.4               
##  [95] crayon_1.4.1                utf8_1.2.2                 
##  [97] BiocFileCache_2.1.1         rmarkdown_2.11             
##  [99] jpeg_0.1-9                  progress_1.2.2             
## [101] data.table_1.14.0           blob_1.2.2                 
## [103] digest_0.6.27               munsell_0.5.0              
## [105] bslib_0.3.0

Home

Workshop - Working with DNA sequences and features in R with Bioconductor - version 2