Installing Bioconductor packages
Bioconductor packages are installed using the CRAN package “BiocManager”.
## 1. Install BiocManager to your hard disk:
#
# install.packages("BiocManager")
#
## 2. Use BiocManager to install a BioConductor package to your hard disk:
#
# BiocManager::install("Biostrings")
#
## 3. Load the package into memory in your current R session:
#
# library(Biostrings)
#
To update any out-of-date packages use install()
with no arguments. Restart your R session after doing this!
# BiocManager::install()
Bioconductor packages usually have useful documentation in the form of “vignettes”. These are readable on the Bioconductor website https://bioconductor.org/packages/release/BiocViews.html#___Software , from the RStudio “help” pane, or within R:
help(package="Biostrings")
# -> select "User guides, package vignettes, and other documentation"
vignette("BiostringsQuickOverview", package="Biostrings")
Main packages used in this workshop
library(Biostrings) # Provides DNAString, DNAStringSet, etc
library(BSgenome) # Provides getSeq()
library(GenomicRanges) # Provides GRanges containing genomic ranges
library(rtracklayer) # Provides import() and export()
library(Gviz) # Provides plotting of genomic features
Sequences
DNAString
Package Biostrings
offers classes for storing DNA strings, DNAString
, amino acid sequences, AAString
, or anything else in a BString
. These are like character strings, but a variety of biologically meaningful functions can be applied to them.
myseq <- DNAString("CCGCGCACCAAC")
myseq
## 12-letter DNAString object
## seq: CCGCGCACCAAC
class(myseq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
reverseComplement(myseq)
## 12-letter DNAString object
## seq: GTTGGTGCGCGG
translate(myseq)
## 4-letter AAString object
## seq: PRTN
subseq(myseq, 3,5)
## 3-letter DNAString object
## seq: GCG
myseq[3:5]
## 3-letter DNAString object
## seq: GCG
as.character(myseq)
## [1] "CCGCGCACCAAC"
You can see a complete set of functions that work with DNAString with:
methods(class="DNAString")
You can see internal details of a class with:
getClass("DNAString")
The “slots”, used to store data internally in these objects, are accessible with, eg, myseq@length
. However, wherever possible, we should leave the slots alone and use “methods” to access information in the object, eg length(myseq)
! The classes that this class “extends” tell us some further ways we can interact with it.
You can get help on the DNAString
class as below. Try this with classes it extends too!
?"DNAString-class"
DNAStringSet
Often we want to work with a list of sequences, such as chromosomes.
myset <- DNAStringSet( list(chrI=myseq, chrII=DNAString("ACGTACGT")) )
myset
## DNAStringSet object of length 2:
## width seq names
## [1] 12 CCGCGCACCAAC chrI
## [2] 8 ACGTACGT chrII
# A DNAStringSet is list-like
myset$chrII
## 8-letter DNAString object
## seq: ACGTACGT
# or myset[["chrII"]]
# or myset[[2]]
Challenge: sequences
- Reverse complement the following DNA sequence. Translate the result to an amino acid sequence.
"GTAGAGTAATATGGA"
- Also translate bases 3 to 11 of the reverse complement.
Genomic ranges
We may then wish to refer to regions of these sequences, often with an associated strand. This is done with the GRanges
type. GRanges
builds on IRanges
, “integer ranges”. An IRanges has starts and ends. A GRanges additionally has sequence names and strand information.
range1 <- GRanges("chrI", IRanges(start=3,end=5), "+")
range1
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
getSeq(myset, range1)
## DNAStringSet object of length 1:
## width seq
## [1] 3 GCG
range2 <- GRanges("chrI", IRanges(start=3,end=5), "-")
getSeq(myset, range2)
## DNAStringSet object of length 1:
## width seq
## [1] 3 CGC
Accessing GRanges data:
seqnames(range1)
## factor-Rle of length 1 with 1 run
## Lengths: 1
## Values : chrI
## Levels(1): chrI
start(range1)
## [1] 3
end(range1)
## [1] 5
width(range1)
## [1] 3
strand(range1)
## factor-Rle of length 1 with 1 run
## Lengths: 1
## Values : +
## Levels(3): + - *
as.data.frame(range1)
## seqnames start end width strand
## 1 chrI 3 5 3 +
Internals of an “S4” object such as a GRanges can be accessed using @
, but this is discouraged. It is better to use the accessor functions above. Observe the completions RStudio offers when you type range1@
.
# Look at completions for
# range1@
Further manipulations:
# GRanges are like vectors:
c(range1, range2)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## [2] chrI 3-5 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# GRanges can have metadata columns, and are often used like data frames:
mcols(range1)$wobble <- 10
range1
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | wobble
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chrI 3-5 + | 10
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mcols(range1)
## DataFrame with 1 row and 1 column
## wobble
## <numeric>
## 1 10
range1$wobble
## [1] 10
# A handy way to create a GRanges
as("chrI:3-5:+", "GRanges")
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Question
Based on what we saw for DNAString
, how can we learn more about using GRanges
and IRanges
objects?
Visualization
A couple of options for showing ranges are the GViz
and ggbio
packages. ggbio
is a ggplot2
extension. ggbio
is a great idea, but (as at late 2021) the package we could use some love to polish and update it. GViz
takes a grid
graphics approach.
Here we’ll use the GViz
package, which is somewhat confusing to use but produces nice graphics out of the box.
options(ucscChromosomeNames=FALSE)
look <- function(features, labels=seq_along(features)) {
axis_track <- GenomeAxisTrack()
feature_track <- AnnotationTrack(
features, id=labels, name="",
showFeatureId=TRUE, fontcolor.feature="black")
plotTracks(
list(axis_track, feature_track),
extend.left=1, extend.right=2)
}
look( c(range1, range2) )
Loading files
Loading sequences
DNA sequences are generally stored in FASTA format, a simple text format. These can be loaded with readDNAStringSet
from Biostrings
. Let’s load the genome of the Caeonrhabditis elegans worm, obtained from the Ensembl FTP site. This is the “soft masked” (dna_sm) version, with repetitive regions given in lower case, but we won’t be making use of this today.
### After "gunzip"ing, the start of the .fa file looks like this:
# >I dna_sm:chromosome chromosome:WBcel235:I:1:15072434:1 REF
# gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
# gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
# gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
# ...
# ttttttagaaaaattatttttaagaatttttcattttAGGAATATTGTTATTTCAGAAAA
# TAGCTAAATGTGATTTCTGTAATTTTGCCTGCCAAATTCGTGAAATGCAATAAAAATCTA
# ATATCCCTCATCAGTGCGATTTCCGAATCAGTATATTTTTACGTAATAGCTTCTTTGACA
# TCAATAAGTATTTGCCTATATGACTTTAGACTTGAAATTGGCTATTAATGCCAATTTCAT
# ...
seqs <- readDNAStringSet("Caenorhabditis_elegans.WBcel235.dna_sm.toplevel.fa.gz")
seqs
## DNAStringSet object of length 7:
## width seq names
## [1] 15072434 GCCTAAGCCTAAGCCTAAGCCT...GGCTTAGGTTTAGGCTTAGGC I dna_sm:chromoso...
## [2] 15279421 CCTAAGCCTAAGCCTAAGCCTA...AGGCTTAGGCTTAGGCTTAGT II dna_sm:chromos...
## [3] 13783801 CCTAAGCCTAAGCCTAAGCCTA...AGGCTTAGGCTTAGGCTTAGG III dna_sm:chromo...
## [4] 17493829 CCTAAGCCTAAGCCTAAGCCTA...AGGCTTAGGCTTAGGCTTAGG IV dna_sm:chromos...
## [5] 20924180 GAATTCCTAAGCCTAAGCCTAA...TTAGGCTTAGGCTTAGGCTTA V dna_sm:chromoso...
## [6] 17718942 CTAAGCCTAAGCCTAAGCCTAA...AGGCTTAGGCTTAGGCTTAGG X dna_sm:chromoso...
## [7] 13794 CAGTAAATAGTTTAATAAAAAT...ACTTTGTATATATCTATATTA MtDNA dna_sm:chro...
# Our chromosome name is too verbose.
# Remove everything from the name after the first space.
names(seqs)
## [1] "I dna_sm:chromosome chromosome:WBcel235:I:1:15072434:1 REF"
## [2] "II dna_sm:chromosome chromosome:WBcel235:II:1:15279421:1 REF"
## [3] "III dna_sm:chromosome chromosome:WBcel235:III:1:13783801:1 REF"
## [4] "IV dna_sm:chromosome chromosome:WBcel235:IV:1:17493829:1 REF"
## [5] "V dna_sm:chromosome chromosome:WBcel235:V:1:20924180:1 REF"
## [6] "X dna_sm:chromosome chromosome:WBcel235:X:1:17718942:1 REF"
## [7] "MtDNA dna_sm:chromosome chromosome:WBcel235:MtDNA:1:13794:1 REF"
names(seqs) <- sub(" .*","",names(seqs))
names(seqs)
## [1] "I" "II" "III" "IV" "V" "X" "MtDNA"
Conversely, a DNAStringSet can be written to a file with writeXStringSet
.
Loading features
Genome annotations are available in a variety of text formats such as GFF3 and GTF. They can be loaded with the import
function from rtracklayer
. This GTF file is also from Ensembl, and gives the locations of the genes in the genome, and features within them.
### TSV with columns: seqname, source, start, end, score, strand, frame, attributes
### The start of the .gtf file looks like this:
# #!genome-build WBcel235
# #!genome-version WBcel235
# #!genome-date 2012-12
# #!genome-build-accession GCA_000002985.3
# #!genebuild-last-updated 2019-01
# V WormBase gene 9244402 9246360 . - . gene_id "WBGene00000003"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding";
# V WormBase transcript 9244402 9246360 . - . gene_id "WBGene00000003"; transcript_id "F07C3.7.1"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding";
# V WormBase exon 9246080 9246360 . - . gene_id "WBGene00000003"; transcript_id "F07C3.7.1"; exon_number "1"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; exon_id "F07C3.7.1.e1";
# V WormBase CDS 9246080 9246352 . - 0 gene_id "WBGene00000003"; transcript_id "F07C3.7.1"; exon_number "1"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; protein_id "F07C3.7.1";
# V WormBase start_codon 9246350 9246352 . - 0 gene_id "WBGene00000003"; transcript_id "F07C3.7.1"; exon_number "1"; gene_name "aat-2"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding";
# ...
features <- import("Caenorhabditis_elegans.WBcel235.104.gtf.gz")
# Optional: just retain the columns of metadata we need
mcols(features) <- mcols(features)[,c("type","gene_name","gene_id","transcript_biotype","transcript_id")]
features
## GRanges object with 734022 ranges and 5 metadata columns:
## seqnames ranges strand | type gene_name
## <Rle> <IRanges> <Rle> | <factor> <character>
## [1] V 9244402-9246360 - | gene aat-2
## [2] V 9244402-9246360 - | transcript aat-2
## [3] V 9246080-9246360 - | exon aat-2
## [4] V 9246080-9246352 - | CDS aat-2
## [5] V 9246350-9246352 - | start_codon aat-2
## ... ... ... ... . ... ...
## [734018] MtDNA 10403-11354 + | transcript MTCE.33
## [734019] MtDNA 10403-11354 + | exon MTCE.33
## [734020] MtDNA 13275-13327 + | gene MTCE.36
## [734021] MtDNA 13275-13327 + | transcript MTCE.36
## [734022] MtDNA 13275-13327 + | exon MTCE.36
## gene_id transcript_biotype transcript_id
## <character> <character> <character>
## [1] WBGene00000003 <NA> <NA>
## [2] WBGene00000003 protein_coding F07C3.7.1
## [3] WBGene00000003 protein_coding F07C3.7.1
## [4] WBGene00000003 protein_coding F07C3.7.1
## [5] WBGene00000003 protein_coding F07C3.7.1
## ... ... ... ...
## [734018] WBGene00014472 rRNA MTCE.33
## [734019] WBGene00014472 rRNA MTCE.33
## [734020] WBGene00014473 <NA> <NA>
## [734021] WBGene00014473 tRNA MTCE.36
## [734022] WBGene00014473 tRNA MTCE.36
## -------
## seqinfo: 7 sequences from an unspecified genome; no seqlengths
Seqinfo and genome versions
Various objects have associated Seqinfo information, listing the chromosomes they refer to and their lengths. This allows some extra sanity checking, and is also necessary for some tasks.
seqinfo(features)
## Seqinfo object with 7 sequences from an unspecified genome; no seqlengths:
## seqnames seqlengths isCircular genome
## I NA NA <NA>
## II NA NA <NA>
## III NA NA <NA>
## IV NA NA <NA>
## V NA NA <NA>
## X NA NA <NA>
## MtDNA NA NA <NA>
seqinfo(seqs)
## Seqinfo object with 7 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## I 15072434 NA <NA>
## II 15279421 NA <NA>
## III 13783801 NA <NA>
## IV 17493829 NA <NA>
## V 20924180 NA <NA>
## X 17718942 NA <NA>
## MtDNA 13794 NA <NA>
myseqinfo <- seqinfo(seqs)
isCircular(myseqinfo) <- c(rep(FALSE,6),TRUE)
seqinfo(features) <- myseqinfo
One thing you need to care about is the genome version. We often refer to genome versions by their name in the UCSC genome browser. For example the two most recent versions of the worm genome are “ce10” from 2010 and “ce11” from 2013. Similarly with human data you may find data for either the “hg19” or “hg38” versions of the human genome.
Seqinfo(genome="ce10")
## Seqinfo object with 7 sequences (1 circular) from ce10 genome:
## seqnames seqlengths isCircular genome
## chrI 15072423 FALSE ce10
## chrII 15279345 FALSE ce10
## chrIII 13783700 FALSE ce10
## chrIV 17493793 FALSE ce10
## chrV 20924149 FALSE ce10
## chrX 17718866 FALSE ce10
## chrM 13794 TRUE ce10
Seqinfo(genome="ce11")
## Seqinfo object with 7 sequences (1 circular) from ce11 genome:
## seqnames seqlengths isCircular genome
## chrI 15072434 FALSE ce11
## chrII 15279421 FALSE ce11
## chrIII 13783801 FALSE ce11
## chrIV 17493829 FALSE ce11
## chrV 20924180 FALSE ce11
## chrX 17718942 FALSE ce11
## chrM 13794 TRUE ce11
Notice the slightly different chromosome lengths! Features from one genome version will not be correctly located on a different genome version.
Our ENSEMBL sequences match “ce11”, also called “WBcel235”. However the chromosome names are slightly different, a common source of pain. See ?seqlevelsStyle
.
Querying GRanges objects
The metadata columns let us query the GRanges, for example for a feature type:
subset(features, type == "CDS")
# Equivalently:
# features[features$type == "CDS"]
## GRanges object with 225595 ranges and 5 metadata columns:
## seqnames ranges strand | type gene_name
## <Rle> <IRanges> <Rle> | <factor> <character>
## [1] V 9246080-9246352 - | CDS aat-2
## [2] V 9245588-9246033 - | CDS aat-2
## [3] V 9245443-9245539 - | CDS aat-2
## [4] V 9244683-9245315 - | CDS aat-2
## [5] V 11470583-11470598 - | CDS aat-6
## ... ... ... ... . ... ...
## [225591] MtDNA 6506-7732 + | CDS nduo-4
## [225592] MtDNA 7845-9419 + | CDS ctc-1
## [225593] MtDNA 9649-10341 + | CDS ctc-2
## [225594] MtDNA 11356-11688 + | CDS nduo-3
## [225595] MtDNA 11691-13271 + | CDS nduo-5
## gene_id transcript_biotype transcript_id
## <character> <character> <character>
## [1] WBGene00000003 protein_coding F07C3.7.1
## [2] WBGene00000003 protein_coding F07C3.7.1
## [3] WBGene00000003 protein_coding F07C3.7.1
## [4] WBGene00000003 protein_coding F07C3.7.1
## [5] WBGene00000007 protein_coding T11F9.4a.1
## ... ... ... ...
## [225591] WBGene00010963 protein_coding MTCE.25.1
## [225592] WBGene00010964 protein_coding MTCE.26.1
## [225593] WBGene00010965 protein_coding MTCE.31.1
## [225594] WBGene00010966 protein_coding MTCE.34.1
## [225595] WBGene00010967 protein_coding MTCE.35.1
## -------
## seqinfo: 7 sequences (1 circular) from an unspecified genome
Note: subset
is a generic R function. It is similar to dplyr’s filter
. The second argument is special, in it you can refer to columns of the GRanges directly.
Representation of genes in a GTF file
Let’s have a look at the different feature types in a particular gene.
trx1_features <- subset(features, gene_name == "trx-1")
# Equivalently:
# features[features$gene_name == "trx-1" & !is.na(features$gene_name)]
trx1_features
## GRanges object with 25 ranges and 5 metadata columns:
## seqnames ranges strand | type gene_name
## <Rle> <IRanges> <Rle> | <factor> <character>
## [1] II 7752637-7753607 - | gene trx-1
## [2] II 7752637-7753513 - | transcript trx-1
## [3] II 7753283-7753513 - | exon trx-1
## [4] II 7753283-7753489 - | CDS trx-1
## [5] II 7753487-7753489 - | start_codon trx-1
## ... ... ... ... . ... ...
## [21] II 7752637-7752885 - | exon trx-1
## [22] II 7752820-7752885 - | CDS trx-1
## [23] II 7752817-7752819 - | stop_codon trx-1
## [24] II 7753540-7753607 - | five_prime_utr trx-1
## [25] II 7752637-7752816 - | three_prime_utr trx-1
## gene_id transcript_biotype transcript_id
## <character> <character> <character>
## [1] WBGene00015062 <NA> <NA>
## [2] WBGene00015062 protein_coding B0228.5a.1
## [3] WBGene00015062 protein_coding B0228.5a.1
## [4] WBGene00015062 protein_coding B0228.5a.1
## [5] WBGene00015062 protein_coding B0228.5a.1
## ... ... ... ...
## [21] WBGene00015062 protein_coding B0228.5b.1
## [22] WBGene00015062 protein_coding B0228.5b.1
## [23] WBGene00015062 protein_coding B0228.5b.1
## [24] WBGene00015062 protein_coding B0228.5b.1
## [25] WBGene00015062 protein_coding B0228.5b.1
## -------
## seqinfo: 7 sequences (1 circular) from an unspecified genome
look(trx1_features, trx1_features$type)
Look at the different types in the “type” column. Each “gene” may have multiple “transcript” features (isoforms). Each transcript in turn has a set of “exon” features, and if it is a protein coding gene, a set of “CDS” (coding sequence) features. The CDS features cover a subset of the bases covered by the exon features. The “gene_id” and “transcript_id” columns let us know which transcript and gene each exon belongs to.
# --------------------------------------------------> gene
#
# --------------------------------------------> transcript
# ----------> ---> ----------------> exon
# ----> ---> ----------> CDS
#
#
# -----------------------------------> transcript
# --------> ----> ----------> exon
# ---> ----> --> CDS
Let’s look at this in the Integrative Genome Browser (IGV):
- Open IGV.
- Select the genome “C. elegans (ce11)” (top left drop-down box).
- Load the GTF file (“File” menu, “Load from file…”).
- Search for “trx-1” in the location box (next to the “Go” button).
IGV turns all these features into a nice diagram of the gene’s transcripts. The ENSEMBL gene annotation contains 5’UTR and 3’UTR regions, which are missing from the default gene annotations shown by IGV (probably from NCBI’s RefSeq). Gviz
or ggbio
can be used to produce diagrams like this from within R (with a little work).
Further operations on GRanges
💡 Using accessor methods like “start()” and “strand()” frees the authors of GRanges to store data internally using whatever efficiency tricks they want, and even to change this in a new version of Bioconductor. This is good, but it can be better:
💡 Double-stranded DNA has a rotational symmetry. None of the physics of DNA is changed if we look at it rotated end-to-end 180 degrees. Labelling of the strands as “+” and “-” is arbitrary. Also, it is usually not the exact position of features in a genome that is important so much as their relative position. If we restrict outselves to using functions that work just as well with either labelling of strands, and that are not affected by absolute position, our code will more directly express our intention and be less prone to bugs.
Intra-range
Various useful manipulations of individual ranges are defined.
?"intra-range-methods"
Note: How these make use of the strand is a little haphazard. For example flank()
, resize()
, and promoters()
respect strand but shift()
does not.
For example, suppose we are interested in the first 100 bases of a transcript:
feat <- features[2]
feat_initial <- resize(feat, 100, fix="start")
look( c(feat, feat_initial) )
getSeq(seqs, feat_initial)
## DNAStringSet object of length 1:
## width seq
## [1] 100 ACGCAAAAATGAACGAAAAAGAAGAAGAAGTATC...ACTTTTCAATGGCTGCACAATCATTATCGGAGT
resize
can fix either the fix="start"
or fix="end"
of the sequence.
flank
can be either flank the start (start=TRUE
) or end (start=FALSE
).
# 5' 3'
# input ------->
# . .
# resize(input, n, fix="start") --> .
# resize(input, n, fix="end") . -->
# . .
# flank(input, n, start=TRUE) -->. .
# flank(input, n, start=FALSE) . .-->
# . .
# promoters(input, n1,n2) --------> .
#
Inter-range
?"inter-range-methods"
?"nearest-methods"
?"setops-methods"
One compelling feature of GenomicRanges is that it is able to find overlapping ranges very quickly.
query <- as("II:7752600-7754000:-", "GRanges")
hits <- findOverlaps(query, features, ignore.strand=TRUE)
hits
## Hits object with 35 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 488078
## [2] 1 488079
## [3] 1 488121
## [4] 1 488124
## [5] 1 488125
## ... ... ...
## [31] 1 488255
## [32] 1 488256
## [33] 1 488257
## [34] 1 488258
## [35] 1 488259
## -------
## queryLength: 1 / subjectLength: 734022
subjectHits(hits)
## [1] 488078 488079 488121 488124 488125 488173 488176 488207 488231 488234
## [11] 488235 488236 488237 488238 488239 488240 488241 488242 488243 488244
## [21] 488245 488246 488247 488248 488249 488250 488251 488252 488253 488254
## [31] 488255 488256 488257 488258 488259
features[subjectHits(hits)]
## GRanges object with 35 ranges and 5 metadata columns:
## seqnames ranges strand | type gene_name
## <Rle> <IRanges> <Rle> | <factor> <character>
## [1] II 7724903-7752715 + | gene cpna-2
## [2] II 7724903-7752715 + | transcript cpna-2
## [3] II 7752309-7752715 + | exon cpna-2
## [4] II 7752429-7752715 + | three_prime_utr cpna-2
## [5] II 7724903-7752715 + | transcript cpna-2
## ... ... ... ... . ... ...
## [31] II 7752637-7752885 - | exon trx-1
## [32] II 7752820-7752885 - | CDS trx-1
## [33] II 7752817-7752819 - | stop_codon trx-1
## [34] II 7753540-7753607 - | five_prime_utr trx-1
## [35] II 7752637-7752816 - | three_prime_utr trx-1
## gene_id transcript_biotype transcript_id
## <character> <character> <character>
## [1] WBGene00015061 <NA> <NA>
## [2] WBGene00015061 protein_coding B0228.4a.1
## [3] WBGene00015061 protein_coding B0228.4a.1
## [4] WBGene00015061 protein_coding B0228.4a.1
## [5] WBGene00015061 protein_coding B0228.4c.1
## ... ... ... ...
## [31] WBGene00015062 protein_coding B0228.5b.1
## [32] WBGene00015062 protein_coding B0228.5b.1
## [33] WBGene00015062 protein_coding B0228.5b.1
## [34] WBGene00015062 protein_coding B0228.5b.1
## [35] WBGene00015062 protein_coding B0228.5b.1
## -------
## seqinfo: 7 sequences (1 circular) from an unspecified genome
findOverlaps(query, features, ignore.strand=FALSE)
## Hits object with 25 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 488235
## [2] 1 488236
## [3] 1 488237
## [4] 1 488238
## [5] 1 488239
## ... ... ...
## [21] 1 488255
## [22] 1 488256
## [23] 1 488257
## [24] 1 488258
## [25] 1 488259
## -------
## queryLength: 1 / subjectLength: 734022
With findOverlaps
, we can use genomic location as the key when joining disparate types of data, so this is an important tool for integrative analysis. See also the related functions nearest
, precede
, follow
, and distance
.
GenomicRanges
also provides:
range
- get a feature that spans from the start to the end of all features in a GRanges
.
reduce
- merge overlapping features, so that the same bases are covered by a reduced collection of features.
disjoin
- as with reduce, but broken at each start and end of the input features.
setdiff
- subtracts one set of features from another, could be used with range on a set of exons to get introns. Might need to use GenomicRanges::setdiff
if also using dplyr
.
# input
# --------->
# ----------------->
# ------>
# ---->
#
# range -------------------------------->
#
# reduce -----------------------> ---->
#
# disjoin ---->---->-->------>---> ---->
#
# setdiff(range(input),input) --->
trx1_exons <- subset(trx1_features, type == "exon")
look( trx1_exons )
look( range(trx1_exons) )
look( reduce(trx1_exons) )
look( disjoin(trx1_exons) )
Challenge: transcript initiation
transcripts <- subset(features, type == "transcript")
Make a GRanges of all transcripts with a biotype of “protein_coding”.
Make a GRanges of the two bases immediately preceding each of these transcripts.
Get the DNA sequences of the ranges from part 2. Use table()
to see if any sequences are particularly common.
Advanced: Get the four bases spanning the start of the transcript (two upstrand of the start and two downstrand of the start).
Note: In C. elegans transcript initiation is complicated by trans-splicing. I wonder what exactly the 5’ ends of these transcript annotations represent?
Example: Examining the PolyAdenylation Signal (PAS)
We’ll now look at an example where the location of the thing we’re looking for is more uncertain.
We suspect there is some motif that causes cleavage and polyadenylation of RNA transcripts, close to the 3’ end of each transcript. Bioconductor will let us explore sequence around the ends of transcripts. It can also help prepare data for command-line software such as the MEME Suite.
We’ll limit our attention to protein coding transcripts.
transcripts <- subset(features,
type == "transcript" & transcript_biotype == "protein_coding")
ends <- resize(transcripts, width=30, fix="end")
end_seqs <- getSeq(seqs, ends)
names(end_seqs) <- ends$transcript_id
We can check for general biasses in base composition:
library(seqLogo)
letter_counts <- consensusMatrix(end_seqs)
props <- prop.table(letter_counts[1:4,], 2)
seqLogo(props, ic.scale=FALSE)
seqLogo(props)
Saving the sequences to a file lets us apply command-line tools:
writeXStringSet(end_seqs, "end_seqs.fasta")
This is suitable for use with, for example, the streme
tool from MEME.
# streme --rna --minw 4 --maxw 15 --p end_seqs.fasta
Ok ok, so actually there is known PolyAdenylation Signal (PAS) motif, canonically “AAUAAA”. Let’s now look for this known motif.
counts <- vcountPattern("AATAAA", end_seqs)
table(counts)
## counts
## 0 1 2 3 4 5
## 22521 10500 501 27 2 1
matches <- vmatchPattern("AATAAA", end_seqs)
plot( tabulate(start(unlist(matches)),30), xlab="start position",ylab="matches")
We could also look for this pattern in the genome in general. Here is some code to find matches on either strand, using vmatchPattern()
.
query <- DNAString("AATAAA")
max.mismatch <- 0
fwd <- vmatchPattern(query, seqs, max.mismatch=max.mismatch)
fwd <- as(fwd, "GRanges")
strand(fwd) <- "+"
rev <- vmatchPattern(reverseComplement(query), seqs, max.mismatch=max.mismatch)
rev <- as(rev, "GRanges")
strand(rev) <- "-"
matches <- c(fwd, rev)
matches
## GRanges object with 276631 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] I 1309-1314 +
## [2] I 2406-2411 +
## [3] I 2577-2582 +
## [4] I 2649-2654 +
## [5] I 2653-2658 +
## ... ... ... ...
## [276627] MtDNA 13477-13482 -
## [276628] MtDNA 13520-13525 -
## [276629] MtDNA 13563-13568 -
## [276630] MtDNA 13606-13611 -
## [276631] MtDNA 13651-13656 -
## -------
## seqinfo: 7 sequences from an unspecified genome; no seqlengths
Once we have a set of ranges, they can be related to other sets of ranges. For example, to reproduce the earlier result:
overlaps <- findOverlaps(matches, ends, type="within")
table(table( subjectHits(overlaps) ))
##
## 1 2 3 4 5
## 10500 501 27 2 1
The matches can be examined in IGV if saved in a GFF file. With a “tabix” index, IGV avoids the need to load the whole file.
export(matches, "motif-matches.gff", index=TRUE)
The pairwiseAlignment()
function provides more flexibility than vmatchPattern()
, if needed. Command-line tools such as BLAST will be faster for longer query sequences.
Similar exploration can be performed around other regions of interest, such as peaks identified from ChIP. R may either provide a complete solution or serve as the glue plugging command-line tools together.
BAMs and bigWigs
import()
can read various other file types. This section will demonstrate reading read alignments from a BAM file and producing a bigWig depth of coverage file. The reads in the BAM file used here is a small sample from GSE57993 sample “N2 Rep1”. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE57993
Both BAM and bigWig files are designed so that specific regions can be accessed efficiently. Genome browsers can use this to only load data as needed. BigWig files also contain versions of their data at different resolutions, so a zoomed-out view can be quickly loaded. Bioconductor packages will also usually offer some way to load specific regions. This is useful because these files can be very large.
alignments <- import("example.bam")
alignments
## GAlignments object with 168060 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] I - 118M 118 2574 2691 118
## [2] I - 131M 131 4120 4250 131
## [3] I - 64M 64 4120 4183 64
## [4] I + 151M 151 16596 16746 151
## [5] I + 30M 30 16614 16643 30
## ... ... ... ... ... ... ... ...
## [168056] MtDNA + 61M 61 13734 13794 61
## [168057] MtDNA + 61M 61 13734 13794 61
## [168058] MtDNA + 61M 61 13734 13794 61
## [168059] MtDNA + 51M 51 13744 13794 51
## [168060] MtDNA + 21M 21 13774 13794 21
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [168056] 0
## [168057] 0
## [168058] 0
## [168059] 0
## [168060] 0
## -------
## seqinfo: 7 sequences from an unspecified genome
depth <- coverage(alignments)
depth
## RleList of length 7
## $I
## integer-Rle of length 15072434 with 15925 runs
## Lengths: 2573 118 1428 64 67 ... 3 4 426 151 1476
## Values : 0 1 0 2 1 ... 17649 17648 0 1 0
##
## $II
## integer-Rle of length 15279421 with 11569 runs
## Lengths: 5498 33 10382 91 9048 ... 45 6 26915 114 7453
## Values : 0 1 0 1 0 ... 3 2 0 1 0
##
## $III
## integer-Rle of length 13783801 with 14706 runs
## Lengths: 17253 86 3187 6 30 ... 116 47 2034 151 2709
## Values : 0 1 0 3 4 ... 0 1 0 1 0
##
## $IV
## integer-Rle of length 17493829 with 13544 runs
## Lengths: 38992 128 69850 30 1689 ... 1 5 3 21 23096
## Values : 0 1 0 1 0 ... 9 8 2 1 0
##
## $V
## integer-Rle of length 20924180 with 12136 runs
## Lengths: 6583 3 2 2 5 ... 72 25 8185 67 16741
## Values : 0 2 9 10 16 ... 0 1 0 1 0
##
## ...
## <2 more elements>
# Depth of coverage for gene rps-12
plot(depth$III[5679200:5680200], type="l")
export(depth, "depth.bw")
Explore the documentation for “GAlignments”, “Rle”, and “RleList”. “Rle” and “RleList” objects are memory-efficient vectors and lists of vectors. They support arithmetic operation such as scaling with *
and adding with +
. Rather neat! Sometimes a function won’t support “Rle”s. When this happens, an Rle can be converted to a conventional vector with as.numeric()
.
Filtered import of a BAM file can be performed with a suitable use of ScanBamParam
from the Rsamtools package, import("example.bam", param=ScanBamParam( ... ))
.
- Specific strand.
- Specific region of reference genome.
- Specific set of single cells, based on CB attribute.
Try loading the “motif-matches.gff”, “example.bam”, and “depth.bw” files into IGV. Look for example at gene “rps-12”.
(return to slideshow)
Genome and annotation resources
Besides software, Bioconductor includes various types of reference data.
Example packages
BSgenome.Hsapiens.UCSC.hg38 Biostrings genome, Homo sapiens, from the UCSC browser, version hg38. DNA for chromosomes, usable in the same way as the DNAStringSet used above.
TxDb.Hsapiens.UCSC.hg38.knownGene Transcript database, Homo sapiens, from UCSC browser, genome verison hg38, “knownGene” gene annotations. GRanges information for genes and transcripts, much as we loaded from a GTF file above.
org.Hs.eg.db Organism Homo sapiens, primary key is Entrez Gene, database. Translation of gene ids from various databases, assignment to GO terms, KEGG pathways, etc. Entrez Gene ids are used as the primary key.
GO.db GO Gene Ontology term descriptions and relationships. However to find which terms apply to specific genes, use the GOALL column in the relevant species’ organism database.
Loading any of these packages gives you an AnnotationDb
object (defined in the package AnnotationDbi
). The main way to query these objects is with the select()
function. mapIds()
can be used if you need to enforce a one-to-one mapping.
library(org.Ce.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
keytypes(org.Ce.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GO" "GOALL" "ONTOLOGY" "ONTOLOGYALL" "PATH"
## [16] "PMID" "REFSEQ" "SYMBOL" "UNIPROT" "WORMBASE"
columns(org.Ce.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GO" "GOALL" "ONTOLOGY" "ONTOLOGYALL" "PATH"
## [16] "PMID" "REFSEQ" "SYMBOL" "UNIPROT" "WORMBASE"
AnnotationDbi::select(org.Ce.eg.db,
keys="trx-1", keytype="SYMBOL", columns=c("ENSEMBL","ENTREZID"))
## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL ENSEMBL ENTREZID
## 1 trx-1 WBGene00015062 181863
head( AnnotationDbi::select(org.Ce.eg.db,
keys="WBGene00015062", keytype="ENSEMBL", columns="GOALL") )
## 'select()' returned 1:many mapping between keys and columns
## ENSEMBL GOALL EVIDENCEALL ONTOLOGYALL
## 1 WBGene00015062 GO:0003674 IDA MF
## 2 WBGene00015062 GO:0003674 IEA MF
## 3 WBGene00015062 GO:0003824 IDA MF
## 4 WBGene00015062 GO:0003824 IEA MF
## 5 WBGene00015062 GO:0005575 IDA CC
## 6 WBGene00015062 GO:0005622 IDA CC
library(GO.db)
##
AnnotationDbi::select(GO.db,
keys="GO:0005622", keytype="GOID", columns="TERM")
## 'select()' returned 1:1 mapping between keys and columns
## GOID TERM
## 1 GO:0005622 intracellular anatomical structure
Transcript database objects
We’ve been using our genomic features as one big unstructured GRanges. This is messy. TxDb objects offer a more structured representation of genes, transcripts, exons, and coding sequences.
library(GenomicFeatures)
## Warning: package 'GenomicFeatures' was built under R version 4.1.1
txdb <- makeTxDbFromGRanges(features)
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Genome: NA
## # Nb of transcripts: 61451
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-09-24 04:45:52 +1000 (Fri, 24 Sep 2021)
## # GenomicFeatures version at creation time: 1.45.2
## # RSQLite version at creation time: 2.2.8
## # DBSCHEMAVERSION: 1.2
txdb
is a “TxDb” object. Some ways to extract data from a TxDb are:
genes(txdb)
transcriptsBy(txdb, by="gene")
exonsBy(txdb, use.names=TRUE)
cdsBy(txdb, use.names=TRUE)
cds_ranges <- cdsBy(txdb, use.names=TRUE)
cds_ranges$B0228.5a.1
cds_ranges[["B0228.5a.1"]]
unlist(cds_ranges)
cds_ranges
here is a “GRangesList”. That is, a list containing GRanges objects. To get the transcript sequence or the coding sequence, these each need to be retrieved and then concatenated together. extractTranscriptSeqs()
can do this for us.
extractTranscriptSeqs(seqs, cds_ranges)
## DNAStringSet object of length 33552:
## width seq names
## [1] 372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.3
## [2] 372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.1
## [3] 372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.4
## [4] 372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.5
## [5] 372 ATGAAACTCGTAATTCTGCTA...GCTCGTAATCAATTCCAATAA Y74C9A.2a.2
## ... ... ...
## [33548] 1227 TTGTTAGAATTTCTTTTTATC...TTTTGATTAAGTGTGTTTTAC MTCE.25.1
## [33549] 1575 ATTAATCTTTATAAAAAATAT...AGAACTACTAGATTAAAAAAT MTCE.26.1
## [33550] 693 ATTAATAATTTTTTTCAAGGA...TGATGTTTTGGTACTATAGAA MTCE.31.1
## [33551] 333 ATTTTAGTGCTATTAATAGTT...GGTAAATTAGTTTGAGTAATT MTCE.34.1
## [33552] 1581 ATTAATATTTCTATTTTTTTG...TTTATTTTTTTTATAATTTGT MTCE.35.1
A TxDb can also be accessed like an AnnotationDb.
keytypes(txdb)
## [1] "CDSID" "CDSNAME" "EXONID" "EXONNAME" "GENEID" "TXID" "TXNAME"
columns(txdb)
## [1] "CDSCHROM" "CDSEND" "CDSID" "CDSNAME" "CDSPHASE"
## [6] "CDSSTART" "CDSSTRAND" "EXONCHROM" "EXONEND" "EXONID"
## [11] "EXONNAME" "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID"
## [16] "TXCHROM" "TXEND" "TXID" "TXNAME" "TXSTART"
## [21] "TXSTRAND" "TXTYPE"
Under the hood these are all SQLite databases. select()
joins tables together as needed to answer a query.
DBI::dbListTables( dbconn(txdb) )
## [1] "cds" "chrominfo" "exon" "gene" "metadata"
## [6] "splicing" "transcript"
DBI::dbListFields( dbconn(txdb), "transcript" )
## [1] "_tx_id" "tx_name" "tx_type" "tx_chrom" "tx_strand" "tx_start"
## [7] "tx_end"
There’s much more to explore here. Have a look at the documentation for the GenomicFeatures
and ensembldb
packages. To annotate a set of genomic features such as peaks from a ChIP-seq experiment, see for example the ChIPseeker
package.
biomaRt
BioMart servers accessed using the biomaRt
package are another way to get information such as translation of gene ids, gene sets, and gene information. For reproducibility, if using the ENSEMBL servers, make sure to specify a specific version of Ensembl to use. Your scripts may fail in future if the server you are using goes down or changes how it is accessed.
library(biomaRt)
mart <- useEnsembl(biomart = "genes", dataset = "celegans_gene_ensembl", version=104)
getBM(
attributes=c("external_gene_name", "ensembl_gene_id", "entrezgene_id"),
filters="external_gene_name", values="trx-1", mart=mart)
AnnotationHub
AnnotationHub is a way to retrieve data from a more comprehensive set of organisms and data providers than are provided as individual Bioconductor packages. The retrieved data is returned in an appropriate Bioconductor type. If data is being updated over time (eg improved annotation of a genome), each version receives a unique ID in AnnotationHub, making it possible to write reproducible analyses.
Files are cached, so they will only be downloaded once.
In the example below, the yeast genome and Ensembl annotations are retrieved:
library(AnnotationHub)
ah <- AnnotationHub()
# ah contains a large collection of records that can be retrieved
ah
length(ah)
colnames( mcols(ah) )
table( ah$rdataclass )
# There is an interactive Shiny search interface
display(ah)
# query() searches for terms in an unstructured way
query(ah, c("Ensembl", "96", "Saccharomyces cerevisiae"))
# Having located records of interest,
# your R script can refer to the specific AH... record,
# so it always uses the same version of the data.
sc_genome <- ah[["AH70449"]]
sc_granges <- ah[["AH69700"]]
sc_txdb <- ah[["AH69265"]]
# sc_genome is a TwoBitFile
# Can use getSeq on it without loading everything into memory
seqinfo(sc_genome)
getSeq(sc_genome, as("IV:1-10","GRanges"))
import(sc_genome)
Package versions used
Use sessionInfo()
to check the versions of packages used.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GenomicFeatures_1.45.2 GO.db_3.13.0 org.Ce.eg.db_3.13.0
## [4] AnnotationDbi_1.55.1 Biobase_2.53.0 seqLogo_1.59.0
## [7] Gviz_1.37.2 BSgenome_1.61.0 rtracklayer_1.53.1
## [10] GenomicRanges_1.45.0 Biostrings_2.61.2 GenomeInfoDb_1.29.8
## [13] XVector_0.33.0 IRanges_2.27.2 S4Vectors_0.31.3
## [16] BiocGenerics_0.39.2
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 rjson_0.2.20
## [3] ellipsis_0.3.2 biovizBase_1.41.0
## [5] htmlTable_2.2.1 base64enc_0.1-3
## [7] dichromat_2.0-0 rstudioapi_0.13
## [9] bit64_4.0.5 fansi_0.5.0
## [11] xml2_1.3.2 splines_4.1.0
## [13] cachem_1.0.6 knitr_1.34
## [15] Formula_1.2-4 jsonlite_1.7.2
## [17] Rsamtools_2.9.1 cluster_2.1.2
## [19] dbplyr_2.1.1 png_0.1-7
## [21] compiler_4.1.0 httr_1.4.2
## [23] backports_1.2.1 assertthat_0.2.1
## [25] Matrix_1.3-4 fastmap_1.1.0
## [27] lazyeval_0.2.2 htmltools_0.5.2
## [29] prettyunits_1.1.1 tools_4.1.0
## [31] gtable_0.3.0 glue_1.4.2
## [33] GenomeInfoDbData_1.2.6 dplyr_1.0.7
## [35] rappdirs_0.3.3 Rcpp_1.0.7
## [37] jquerylib_0.1.4 vctrs_0.3.8
## [39] xfun_0.26 stringr_1.4.0
## [41] lifecycle_1.0.0 restfulr_0.0.13
## [43] ensembldb_2.17.4 XML_3.99-0.8
## [45] zlibbioc_1.39.0 scales_1.1.1
## [47] VariantAnnotation_1.39.0 hms_1.1.0
## [49] MatrixGenerics_1.5.4 ProtGenerics_1.25.1
## [51] parallel_4.1.0 SummarizedExperiment_1.23.4
## [53] AnnotationFilter_1.17.1 RColorBrewer_1.1-2
## [55] yaml_2.2.1 curl_4.3.2
## [57] memoise_2.0.0 gridExtra_2.3
## [59] ggplot2_3.3.5 sass_0.4.0
## [61] biomaRt_2.49.4 rpart_4.1-15
## [63] latticeExtra_0.6-29 stringi_1.7.4
## [65] RSQLite_2.2.8 highr_0.9
## [67] BiocIO_1.3.0 checkmate_2.0.0
## [69] filelock_1.0.2 BiocParallel_1.27.9
## [71] rlang_0.4.11 pkgconfig_2.0.3
## [73] matrixStats_0.61.0 bitops_1.0-7
## [75] evaluate_0.14 lattice_0.20-44
## [77] purrr_0.3.4 GenomicAlignments_1.29.0
## [79] htmlwidgets_1.5.4 bit_4.0.4
## [81] tidyselect_1.1.1 magrittr_2.0.1
## [83] R6_2.5.1 generics_0.1.0
## [85] Hmisc_4.5-0 DelayedArray_0.19.3
## [87] DBI_1.1.1 pillar_1.6.2
## [89] foreign_0.8-81 survival_3.2-13
## [91] KEGGREST_1.33.0 RCurl_1.98-1.5
## [93] nnet_7.3-16 tibble_3.1.4
## [95] crayon_1.4.1 utf8_1.2.2
## [97] BiocFileCache_2.1.1 rmarkdown_2.11
## [99] jpeg_0.1-9 progress_1.2.2
## [101] data.table_1.14.0 blob_1.2.2
## [103] digest_0.6.27 munsell_0.5.0
## [105] bslib_0.3.0
Home