This workshop looks at working with sequences, primarily DNA sequences, and genomic features.
Monash Data Fluency’s introductory R workshop focusses on “tidy” data analysis, which emphasizes storing all data in data frames. Bioconductor represents a different strand of current development in R. Bioconductor uses a great variety of data types. It’s the very opposite of tidy!
Nevertheless, Bioconductor is overwhelmingly comprehensive, and represents the most complete environment available for working with bioinformatic data currently available.
Bioconductor packages are installed using the BiocManager
package. For example:
# Install BiocManager from CRAN package repository
install.packages("BiocManager")
# BiocManager can then install packages from Bioconductor package repository
BiocManager::install("Biostrings")
#or
library(BiocManager)
install("Biostrings")
Bioconductor packages usually have useful documentation in the form of “vignettes”. These are readable on the Bioconductor website, from the RStudio “help” pane, or within R:
vignette()
vignette(package="Biostrings")
vignette("BiostringsQuickOverview", package="Biostrings")
# or browse https://bioconductor.org/packages/release/BiocViews.html#___Software
Load packages we will be using:
library(Biostrings) # Provides DNAString, DNAStringSet, etc
library(BSgenome) # Provides getSeq()
library(GenomicRanges) # Provides GRanges containing genomic ranges, etc
library(GenomicFeatures) # Provides TxDb objects containing genes/transcripts/exons
library(rtracklayer) # Provides import() and export()
Package Biostrings
offers classes for storing DNA strings, DNAString
, amino acid sequences, AAString
, or anything else in a BString
. These are like character strings, but a variety of biologically meaningful functions can be applied to them.
myseq <- DNAString("CCGCGCACCAAC")
myseq
## 12-letter DNAString object
## seq: CCGCGCACCAAC
class(myseq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
reverseComplement(myseq)
## 12-letter DNAString object
## seq: GTTGGTGCGCGG
translate(myseq)
## 4-letter AAString object
## seq: PRTN
subseq(myseq, 3,5)
## 3-letter DNAString object
## seq: GCG
myseq[3:5]
## 3-letter DNAString object
## seq: GCG
as.character(myseq)
## [1] "CCGCGCACCAAC"
You can see a complete set of functions that work with DNAString with:
methods(class="DNAString")
You can get help on the DNAString
class with:
?"DNAString-class"
Often we want to work with a list of sequences, such as chromosomes.
myset <- DNAStringSet( list(chrI=myseq, chrII=DNAString("ACGTACGT")) )
myset
## DNAStringSet object of length 2:
## width seq names
## [1] 12 CCGCGCACCAAC chrI
## [2] 8 ACGTACGT chrII
# A DNAStringSet is list-like
myset$chrII
## 8-letter DNAString object
## seq: ACGTACGT
# or myset[["chrII"]]
# or myset[[2]]
Reverse complement the following DNA sequence and then translate to an amino acid sequence:
GCTTTCGTTTTCGCC
We may then wish to refer to regions of these sequences, often with an associated strand. This is done with the GRanges
type. GRanges
builds on IRanges
, “integer ranges”. An IRanges has starts and ends. A GRanges additionally has sequence names and strand information.
range1 <- GRanges("chrI", IRanges(start=3,end=5), "+")
range1
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
getSeq(myset, range1)
## DNAStringSet object of length 1:
## width seq
## [1] 3 GCG
range2 <- GRanges("chrI", IRanges(start=3,end=5), "-")
getSeq(myset, range2)
## DNAStringSet object of length 1:
## width seq
## [1] 3 CGC
Accessing GRanges data:
seqnames(range1)
## factor-Rle of length 1 with 1 run
## Lengths: 1
## Values : chrI
## Levels(1): chrI
start(range1)
## [1] 3
end(range1)
## [1] 5
width(range1)
## [1] 3
strand(range1)
## factor-Rle of length 1 with 1 run
## Lengths: 1
## Values : +
## Levels(3): + - *
as.data.frame(range1)
## seqnames start end width strand
## 1 chrI 3 5 3 +
Internals of an “S4” object such as a GRanges can be accessed using @
, but this is discouraged. It is better to use the accessor functions above. Observe the completions RStudio offers when you type range1@
.
# Look at completions for
# range1@
Further manipulations:
# GRanges are like vectors:
c(range1, range2)
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## [2] chrI 3-5 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# GRanges can have metadata columns, and are often used like data frames:
mcols(range1)$wobble <- 10
range1
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | wobble
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chrI 3-5 + | 10
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
mcols(range1)
## DataFrame with 1 row and 1 column
## wobble
## <numeric>
## 1 10
range1$wobble
## [1] 10
# A handy way to create a GRanges
as("chrI:3-5:+", "GRanges")
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chrI 3-5 +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Based on what we saw for DNAString
, how can we learn more about using GRanges
and IRanges
objects?
DNA sequences are generally stored in FASTA format, a simple text format. These can be loaded with readDNAStringSet
from Biostrings
. Let’s load the genome of E. coli strain K-12, obtained from the Ensembl FTP site.
### The start of the .fa file looks like this:
# >Chromosome dna:chromosome chromosome:GCA_000800765.1:Chromosome:1:4558660:1
# AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
# TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
# TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
# ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
# AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
# CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT
# ...
seqs <- readDNAStringSet("r-bioc-files/Escherichia_coli_k_12.GCA_000800765.1.29.dna.genome.fa")
seqs
## DNAStringSet object of length 1:
## width seq names
## [1] 4558660 AGCTTTTCATTCTGACTGCAAC...AACGCCTTAGTAAGTATTTTTC Chromosome dna:ch...
# Our chromosome name is too verbose.
# Remove everything from the name after the first space.
names(seqs)
## [1] "Chromosome dna:chromosome chromosome:GCA_000800765.1:Chromosome:1:4558660:1"
names(seqs) <- sub(" .*","",names(seqs))
names(seqs)
## [1] "Chromosome"
Conversely, a DNAStringSet can also be written to a file with writeXStringSet
.
Genome annotations are available in a variety of text formats such as GFF3 and GTF. They can be loaded with the import
function from rtracklayer
. This GTF file is also from Ensembl, and gives the locations of the genes in the genome, and features within them.
### The start of the .gtf file looks like this:
# #!genome-build ASM80076v1
# #!genome-version GCA_000800765.1
# #!genome-date 2014-12
# #!genome-build-accession GCA_000800765.1
# #!genebuild-last-updated 2014-12
# Chromosome ena gene 190 255 . + . gene_id "ER3413_4519"; gene_version "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding";
# Chromosome ena transcript 190 255 . + . gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding";
# Chromosome ena exon 190 255 . + . gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; exon_number "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AIZ54182-1"; exon_version "1";
# Chromosome ena CDS 190 252 . + 0 gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; exon_number "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AIZ54182"; protein_version "1";
# ...
features <- import("r-bioc-files/Escherichia_coli_k_12.GCA_000800765.1.29.gtf")
# Optional: just retain the columns of metadata we need
mcols(features) <- mcols(features)[,c("type","gene_name","gene_id","transcript_id","exon_id")]
features
## GRanges object with 24926 ranges and 5 metadata columns:
## seqnames ranges strand | type gene_name
## <Rle> <IRanges> <Rle> | <factor> <character>
## [1] Chromosome 190-255 + | gene thrL
## [2] Chromosome 190-255 + | transcript thrL
## [3] Chromosome 190-255 + | exon thrL
## [4] Chromosome 190-252 + | CDS thrL
## [5] Chromosome 190-192 + | start_codon thrL
## ... ... ... ... . ... ...
## [24922] Chromosome 4557950-4558636 + | transcript yjtD
## [24923] Chromosome 4557950-4558636 + | exon yjtD
## [24924] Chromosome 4557950-4558633 + | CDS yjtD
## [24925] Chromosome 4557950-4557952 + | start_codon yjtD
## [24926] Chromosome 4558634-4558636 + | stop_codon yjtD
## gene_id transcript_id exon_id
## <character> <character> <character>
## [1] ER3413_4519 <NA> <NA>
## [2] ER3413_4519 AIZ54182 <NA>
## [3] ER3413_4519 AIZ54182 AIZ54182-1
## [4] ER3413_4519 AIZ54182 <NA>
## [5] ER3413_4519 AIZ54182 <NA>
## ... ... ... ...
## [24922] ER3413_4514 AIZ54177 <NA>
## [24923] ER3413_4514 AIZ54177 AIZ54177-1
## [24924] ER3413_4514 AIZ54177 <NA>
## [24925] ER3413_4514 AIZ54177 <NA>
## [24926] ER3413_4514 AIZ54177 <NA>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Conversely, a GRanges can be written to a file with export
.
Various objects have associated Seqinfo information, listing the chromosomes they refer to and their lengths. This allows some extra sanity checking, and is also necessary for some tasks.
seqinfo(features)
## Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
## seqnames seqlengths isCircular genome
## Chromosome NA NA <NA>
seqinfo(seqs)
## Seqinfo object with 1 sequence from an unspecified genome:
## seqnames seqlengths isCircular genome
## Chromosome 4558660 NA <NA>
myseqinfo <- seqinfo(seqs)
isCircular(myseqinfo) <- c(TRUE)
seqinfo(features) <- myseqinfo
One thing you need to care about is the genome version. We often refer to genome versions by their name in the UCSC genome browser. For example the two most recent versions of the human genome are “hg19” from 2009 and “hg38” from 2013.
Seqinfo(genome="hg19")
## Seqinfo object with 298 sequences (2 circular) from hg19 genome:
## seqnames seqlengths isCircular genome
## chr1 249250621 FALSE hg19
## chr2 243199373 FALSE hg19
## chr3 198022430 FALSE hg19
## chr4 191154276 FALSE hg19
## chr5 180915260 FALSE hg19
## ... ... ... ...
## chr21_gl383580_alt 74652 FALSE hg19
## chr21_gl383581_alt 116690 FALSE hg19
## chr22_gl383582_alt 162811 FALSE hg19
## chr22_gl383583_alt 96924 FALSE hg19
## chr22_kb663609_alt 74013 FALSE hg19
Seqinfo(genome="hg38")
## Seqinfo object with 595 sequences (1 circular) from hg38 genome:
## seqnames seqlengths isCircular genome
## chr1 248956422 FALSE hg38
## chr2 242193529 FALSE hg38
## chr3 198295559 FALSE hg38
## chr4 190214555 FALSE hg38
## chr5 181538259 FALSE hg38
## ... ... ... ...
## chr22_KN196486v1_alt 153027 FALSE hg38
## chr22_KQ458387v1_alt 155930 FALSE hg38
## chr22_KQ458388v1_alt 174749 FALSE hg38
## chr22_KQ759761v1_alt 145162 FALSE hg38
## chrX_KV766199v1_alt 188004 FALSE hg38
Notice the slightly different chromosome lengths! Features from one genome version will not be correctly located on a different genome version.
We can use these annotations to grab sequences from the genome.
feat <- features[4]
feat
## GRanges object with 1 range and 5 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 190-252 + | CDS thrL ER3413_4519
## transcript_id exon_id
## <character> <character>
## [1] AIZ54182 <NA>
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
feat_seq <- getSeq(seqs, feat)
feat_seq
## DNAStringSet object of length 1:
## width seq
## [1] 63 ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGC
translate(feat_seq)
## AAStringSet object of length 1:
## width seq
## [1] 21 MKRISTTITTTITITTGNGAG
The metadata columns let us query the GRanges, for example for a particular gene.
subset(features, gene_name == "lacA")
# Equivalently:
# features[features$gene_name == "lacA" & !is.na(features$gene_name)]
## GRanges object with 6 ranges and 5 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 363147-363758 - | gene lacA ER3413_350
## [2] Chromosome 363147-363758 - | transcript lacA ER3413_350
## [3] Chromosome 363147-363758 - | exon lacA ER3413_350
## [4] Chromosome 363150-363758 - | CDS lacA ER3413_350
## [5] Chromosome 363756-363758 - | start_codon lacA ER3413_350
## [6] Chromosome 363147-363149 - | stop_codon lacA ER3413_350
## transcript_id exon_id
## <character> <character>
## [1] <NA> <NA>
## [2] AIZ53204 <NA>
## [3] AIZ53204 AIZ53204-1
## [4] AIZ53204 <NA>
## [5] AIZ53204 <NA>
## [6] AIZ53204 <NA>
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
Note: subset
is a generic R function. It is also similar to dplyr’s filter
. The second argument is special, in it you can refer to columns of the GRanges directly.
We could also get all features of a particular type.
cds <- subset(features, type == "CDS")
cds
# Equivalently:
# features[features$type == "CDS"]
## GRanges object with 4052 ranges and 5 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 190-252 + | CDS thrL ER3413_4519
## [2] Chromosome 337-2796 + | CDS thrA ER3413_1
## [3] Chromosome 2801-3730 + | CDS thrB ER3413_2
## [4] Chromosome 3734-5017 + | CDS thrC ER3413_3
## [5] Chromosome 5234-5527 + | CDS yaaX ER3413_4
## ... ... ... ... . ... ... ...
## [4048] Chromosome 4553704-4555125 + | CDS creC ER3413_4511
## [4049] Chromosome 4555186-4556535 + | CDS creD ER3413_4512
## [4050] Chromosome 4556601-4557314 - | CDS arcA ER3413_4513
## [4051] Chromosome 4557410-4557547 + | CDS yjjY ER3413_4541
## [4052] Chromosome 4557950-4558633 + | CDS yjtD ER3413_4514
## transcript_id exon_id
## <character> <character>
## [1] AIZ54182 <NA>
## [2] AIZ50783 <NA>
## [3] AIZ51761 <NA>
## [4] AIZ52710 <NA>
## [5] AIZ53685 <NA>
## ... ... ...
## [4048] AIZ54174 <NA>
## [4049] AIZ54175 <NA>
## [4050] AIZ54176 <NA>
## [4051] AIZ54207 <NA>
## [4052] AIZ54177 <NA>
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
Various useful manipulations of individual ranges are defined.
?"intra-range-methods"
Note: How these make use of the strand is a little haphazard. For example flank() and resize() respect strand but shift() does not.
Earlier we translated a coding sequence. Coding sequences are terminated by a stop codon. Let’s extend the CDS feature to include this.
feat <- features[4]
feat_stop <- resize(feat, width(feat)+3)
seq_stop <- getSeq(seqs, feat_stop)
translate(seq_stop)
## AAStringSet object of length 1:
## width seq
## [1] 22 MKRISTTITTTITITTGNGAG*
resize
can fix either the fix="start"
or fix="end"
of the sequence.
flank
can be either flank the start (start=TRUE
) or end (start=FALSE
).
# 5' 3'
# input ------->
# . .
# resize(input, n, fix="start") --> .
# resize(input, n, fix="end") . -->
# . .
# flank(input, n, start=TRUE) -->. .
# flank(input, n, start=FALSE) . .-->
# . .
# promoters(input, n1,n2) --------> .
#
?"inter-range-methods"
?"nearest-methods"
?"setops-methods"
One compelling feature of GenomicRanges is that it is able to find overlapping ranges very quickly.
query <- as("Chromosome:9500-10000:+", "GRanges")
hits <- findOverlaps(query, features, ignore.strand=TRUE)
hits
## Hits object with 10 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 49
## [2] 1 50
## [3] 1 51
## [4] 1 52
## [5] 1 54
## [6] 1 55
## [7] 1 56
## [8] 1 57
## [9] 1 58
## [10] 1 60
## -------
## queryLength: 1 / subjectLength: 24926
subjectHits(hits)
## [1] 49 50 51 52 54 55 56 57 58 60
features[subjectHits(hits)]
## GRanges object with 10 ranges and 5 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 9306-9893 + | gene mog ER3413_8
## [2] Chromosome 9306-9893 + | transcript mog ER3413_8
## [3] Chromosome 9306-9893 + | exon mog ER3413_8
## [4] Chromosome 9306-9890 + | CDS mog ER3413_8
## [5] Chromosome 9891-9893 + | stop_codon mog ER3413_8
## [6] Chromosome 9928-10494 - | gene yaaH ER3413_9
## [7] Chromosome 9928-10494 - | transcript yaaH ER3413_9
## [8] Chromosome 9928-10494 - | exon yaaH ER3413_9
## [9] Chromosome 9931-10494 - | CDS yaaH ER3413_9
## [10] Chromosome 9928-9930 - | stop_codon yaaH ER3413_9
## transcript_id exon_id
## <character> <character>
## [1] <NA> <NA>
## [2] AIZ54621 <NA>
## [3] AIZ54621 AIZ54621-1
## [4] AIZ54621 <NA>
## [5] AIZ54621 <NA>
## [6] <NA> <NA>
## [7] AIZ54730 <NA>
## [8] AIZ54730 AIZ54730-1
## [9] AIZ54730 <NA>
## [10] AIZ54730 <NA>
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
findOverlaps(query, features, ignore.strand=FALSE)
## Hits object with 5 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 49
## [2] 1 50
## [3] 1 51
## [4] 1 52
## [5] 1 54
## -------
## queryLength: 1 / subjectLength: 24926
With findOverlaps
, we can use genomic location as the key when joining disparate types of data, so this is an important tool for integrative analysis. See also the related functions nearest
, precede
, follow
, and distance
.
GenomicRanges
also provides:
range
- get a feature that spans from the start to the end of all features in a GRanges
.reduce
- merge overlapping features, so that the same bases are covered by a reduced collection of features.disjoin
- as with reduce, but broken at each start and end of the input features.setdiff
- subtracts one set of features from another, could be used with range on a set of exons to get introns. Might need to use GenomicRanges::setdiff
if also using dplyr
.# input
# --------->
# ----------------->
# ------>
# ---->
#
# range -------------------------------->
#
# reduce -----------------------> ---->
#
# disjoin ---->---->-->------>---> ---->
#
# setdiff(range(input),input) --->
What are E. coli’s most common start and stop codons?
The start codon is the first three bases of the CDS, and the stop codon is the three bases following the end of the CDS.
Hint: Recall that we could get all CDS ranges with:
cds <- subset(features, type == "CDS")
Hint: Use resize(..., ..., fix="start")
and flank(..., ..., start=FALSE)
to manipulate these ranges.
Hint: Use table()
to count occurrences of different strings.
We’ve been using our genomic features as one big unstructured GRanges. This is messy. Furthermore, eukaryote genes contain exons and introns, introducing complications not seen in bacteria.
Let’s look at the features associated with the gene “lacY”.
subset(features, gene_name == "lacY")
## GRanges object with 6 ranges and 5 metadata columns:
## seqnames ranges strand | type gene_name gene_id
## <Rle> <IRanges> <Rle> | <factor> <character> <character>
## [1] Chromosome 363824-365077 - | gene lacY ER3413_351
## [2] Chromosome 363824-365077 - | transcript lacY ER3413_351
## [3] Chromosome 363824-365077 - | exon lacY ER3413_351
## [4] Chromosome 363827-365077 - | CDS lacY ER3413_351
## [5] Chromosome 365075-365077 - | start_codon lacY ER3413_351
## [6] Chromosome 363824-363826 - | stop_codon lacY ER3413_351
## transcript_id exon_id
## <character> <character>
## [1] <NA> <NA>
## [2] AIZ53215 <NA>
## [3] AIZ53215 AIZ53215-1
## [4] AIZ53215 <NA>
## [5] AIZ53215 <NA>
## [6] AIZ53215 <NA>
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
Look at the different types in the “type” column. Each “gene” may have multiple “transcript” features (isoforms). Each transcript in turn has a set of “exon” features, and if it is a protein coding gene, a set of “CDS” (coding sequence) features. The CDS features cover a subset of the bases covered by the exon features.
# --------------------------------------------------> gene
#
# --------------------------------------------> transcript
# ----------> ---> ----------------> exon
# ----> ---> ----------> CDS
#
#
# -----------------------------------> transcript
# --------> ----> ----------> exon
# ---> ----> --> CDS
txdb <- makeTxDbFromGRanges(features)
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Genome: NA
## # Nb of transcripts: 4257
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-03-24 14:05:15 +1100 (Wed, 24 Mar 2021)
## # GenomicFeatures version at creation time: 1.42.2
## # RSQLite version at creation time: 2.2.4
## # DBSCHEMAVERSION: 1.2
txdb
is a TxDb
object.
genes(txdb)
## GRanges object with 4257 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## ER3413_1 Chromosome 337-2799 + | ER3413_1
## ER3413_10 Chromosome 10643-11356 - | ER3413_10
## ER3413_100 Chromosome 113444-114487 + | ER3413_100
## ER3413_1000 Chromosome 1033581-1034438 + | ER3413_1000
## ER3413_1001 Chromosome 1034572-1036116 + | ER3413_1001
## ... ... ... ... . ...
## ER3413_995 Chromosome 1028971-1030089 + | ER3413_995
## ER3413_996 Chromosome 1030086-1031879 + | ER3413_996
## ER3413_997 Chromosome 1031898-1032605 + | ER3413_997
## ER3413_998 Chromosome 1032602-1033189 + | ER3413_998
## ER3413_999 Chromosome 1033186-1033584 + | ER3413_999
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
transcriptsBy(txdb, by="gene")
## GRangesList object of length 4257:
## $ER3413_1
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] Chromosome 337-2799 + | 2 AIZ50783
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
##
## $ER3413_10
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] Chromosome 10643-11356 - | 2087 AIZ50784
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
##
## $ER3413_100
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | tx_id tx_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] Chromosome 113444-114487 + | 62 AIZ50785
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
##
## ...
## <4254 more elements>
exonsBy(txdb, use.names=TRUE)
## GRangesList object of length 4257:
## $AIZ54182
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] Chromosome 190-255 + | 1 AIZ54182-1 1
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
##
## $AIZ50783
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] Chromosome 337-2799 + | 2 AIZ50783-1 1
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
##
## $AIZ51761
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | exon_id exon_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] Chromosome 2801-3733 + | 3 AIZ51761-1 1
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
##
## ...
## <4254 more elements>
cdsBy(txdb, use.names=TRUE)
## GRangesList object of length 4051:
## $AIZ54182
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] Chromosome 190-255 + | 1 <NA> 1
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
##
## $AIZ50783
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] Chromosome 337-2799 + | 2 <NA> 1
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
##
## $AIZ51761
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] Chromosome 2801-3733 + | 3 <NA> 1
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
##
## ...
## <4048 more elements>
cds_ranges <- cdsBy(txdb, use.names=TRUE)
cds_ranges$AIZ54182
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] Chromosome 190-255 + | 1 <NA> 1
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
cds_ranges[[1]]
## GRanges object with 1 range and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## [1] Chromosome 190-255 + | 1 <NA> 1
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
unlist(cds_ranges)
## GRanges object with 4052 ranges and 3 metadata columns:
## seqnames ranges strand | cds_id cds_name exon_rank
## <Rle> <IRanges> <Rle> | <integer> <character> <integer>
## AIZ54182 Chromosome 190-255 + | 1 <NA> 1
## AIZ50783 Chromosome 337-2799 + | 2 <NA> 1
## AIZ51761 Chromosome 2801-3733 + | 3 <NA> 1
## AIZ52710 Chromosome 3734-5020 + | 4 <NA> 1
## AIZ53685 Chromosome 5234-5530 + | 5 <NA> 1
## ... ... ... ... . ... ... ...
## AIZ54159 Chromosome 4541153-4541797 - | 4048 <NA> 1
## AIZ54165 Chromosome 4545863-4547530 - | 4049 <NA> 1
## AIZ54168 Chromosome 4550241-4550753 - | 4050 <NA> 1
## AIZ54169 Chromosome 4551449-4552318 - | 4051 <NA> 1
## AIZ54176 Chromosome 4556598-4557314 - | 4052 <NA> 1
## -------
## seqinfo: 1 sequence (1 circular) from an unspecified genome
cds_ranges
is a GRangesList
. That is, a list containing GRanges objects. As this is a bacteria, each transcript only has a single range, but in a eukaryote each transcript could consist of many ranges. To get the transcript sequence or the coding sequence, these each need to be retrieved and then concatenated together. extractTranscriptSeqs
can do this for us.
extractTranscriptSeqs(seqs, cds_ranges)
## DNAStringSet object of length 4051:
## width seq names
## [1] 66 ATGAAACGCATTAGCACCACCA...ACAGGTAACGGTGCGGGCTGA AIZ54182
## [2] 2463 ATGCGAGTGTTGAAGTTCGGCG...TCATGGAAGTTAGGAGTCTGA AIZ50783
## [3] 933 ATGGTTAAAGTTTATGCCCCGG...GCACGAGTACTGGAAAACTAA AIZ51761
## [4] 1287 ATGAAACTCTACAATCTGAAAG...TTGATGATGAATCATCAGTAA AIZ52710
## [5] 297 GTGAAAAAGATGCAATCTATCG...CCAGGCAAACATCACCGCTAA AIZ53685
## ... ... ...
## [4047] 645 ATGGCTCGCACAAAACTGAAAT...GAAAGCGAGAAAAAAGAGTGA AIZ54159
## [4048] 1668 GTGGCTCAATTCGTTTATACCA...TACAAGCGTATTGCGAAGTAA AIZ54165
## [4049] 513 ATGCACCAAGTTGTCTGTGCGA...TTTCATAATGCCGTGTATTAA AIZ54168
## [4050] 870 ATGGATCAGGCCGGCATTATTC...CTGATTCCGATCCGTCGTTAA AIZ54169
## [4051] 717 ATGCAGACCCCGCACATTCTTA...TGCGGTGATCTGGAAGATTAA AIZ54176
There’s much more to explore here. Have a look at the documentation for the GenomicFeatures
and ensembldb
packages. To annotate a set of genomic features such as peaks from a ChIP-seq experiment, see for example the ChIPseeker
package.
Besides software, Bioconductor includes various types of data. An important type of data is data describing model organisms. This is either supplied through data packages or through the newer AnnotationHub system. It is generally derived from these central repositories:
UCSC was the original web-based genome browser. UCSC’s “KnownGene” gene annotations used to be the cutting edge gene annotation source, but UCSC now relies on other sources for gene annotations. Many file types that remain important today were developed for the UCSC genome browser, such as “bed”, “bigWig”, and “2bit”.
These organizations will generally obtain genome assemblies from the same ultimate sources. For example, all of the above use the Genome Reference Consortium’s GRCh38 DNA sequence for homo sapiens. UCSC calls this “hg38” but it is the same DNA sequence. These DNA sequences serve as a common frame of reference. However the three organizations above will differ on their exact set of gene and transcript annotations, and use different gene and transcript ID systems.
Genome assemblies are released infrequently. GRCh38 (hg38) was released in 2013. The previous assembly, GRCh37 (hg19) was released in 2009. Some people haven’t updated yet, you will find plenty of data using “hg19” positions! Gene and transcript annotations are updated far more frequently.
As well as the chromosomes in the “primary assembly” a genome assembly may have further sequences, which may have been added after the initial release:
BSgenome.Hsapiens.UCSC.hg38 Biostrings genome, Homo sapiens, from the UCSC browser, version hg38. DNA for chromosomes, usable in the same way as the DNAStringSet used above.
TxDb.Hsapiens.UCSC.hg38.knownGene Transcript database, Homo sapiens, from UCSC browser, genome verison hg38, “knownGene” gene annotations. GRanges information for genes and transcripts, much as we loaded from a GTF file above.
org.Hs.eg.db Organism Homo sapiens, primary key is Entrez Gene, database. Translation of gene ids from various databases, assignment to GO terms, KEGG pathways, etc. Entrez Gene ids are used as the primary key.
GO.db GO Gene Ontology term descriptions and relationships. However to find which terms apply to specific genes, use the GOALL column in the relevant species’ organism database.
BioMart servers accessed using the biomaRt
package are another way to get information such as translation of gene ids, gene sets, and gene information.
Beware replicability. BioMart servers are not guaranteed to be available forever, and BioMart doesn’t have a clear story around versioning.
AnnotationHub is a way to retrieve data from a more comprehensive set of organisms and data providers than are provided as individual Bioconductor packages. The retrieved data is returned in an appropriate Bioconductor type. If data is being updated over time (eg improved annotation of a genome), each version receives a unique ID in AnnotationHub, making it possible to write reproducable analyses.
Files are cached, so they will only be downloaded once.
In the example below, the yeast genome and Ensembl annotations are retrieved:
library(AnnotationHub)
ah <- AnnotationHub()
# ah contains a large collection of records that can be retrieved
ah
length(ah)
colnames( mcols(ah) )
table( ah$rdataclass )
# There is an interactive Shiny search interface
display(ah)
# query() searches for terms in an unstructured way
records <- query(ah, c("Ensembl", "96", "Saccharomyces cerevisiae"))
records
mcols(records)
mcols(records)[,c("title","rdataclass")]
# Having located records of interest,
# your R script can refer to the specific AH... record,
# so it always uses the same version of the data.
sc_genome <- ah[["AH70449"]]
sc_granges <- ah[["AH69700"]]
sc_txdb <- ah[["AH69265"]]
# sc_genome is a TwoBitFile
# Can use getSeq on it without loading everything into memory
seqinfo(sc_genome)
getSeq(sc_genome, as("IV:1-10","GRanges"))
import(sc_genome)
# An OrgDb contains information about genes in an organism, and lets you map between different identifiers
query(ah, c("OrgDb", "Saccharomyces cerevisiae"))
sc_orgdb <- ah[["AH84129"]]
columns(sc_orgdb)
head( keys(sc_orgdb, "ENSEMBL") )
select(sc_orgdb, "YFL039C", keytype="ENSEMBL", columns=c("GENENAME", "DESCRIPTION"))
# As well as IDs, genes have short, easy to remember "symbols" (also often called "names")
# We can use the OrgDb to look up gene IDs from symbols
# Notice a problem here!
select(sc_orgdb, c("ACT1", "COS2"), keytype="GENENAME", columns=c("ENSEMBL"))
AGGAGGU is the Shine-Dalgarno sequence, which assists binding of the ribosome to a transcript.
vmatchPattern("AGGAGGT", seqs)
## MIndex object of length 1
## $Chromosome
## IRanges object with 63 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 56593 56599 7
## [2] 67347 67353 7
## [3] 226876 226882 7
## [4] 229408 229414 7
## [5] 241665 241671 7
## ... ... ... ...
## [59] 4312631 4312637 7
## [60] 4371930 4371936 7
## [61] 4410503 4410509 7
## [62] 4420666 4420672 7
## [63] 4484025 4484031 7
vmatchPattern
is strand specific. If we want matches on the reverse strand we need to also:
vmatchPattern(reverseComplement(DNAString("AGGAGGT")), seqs)
## MIndex object of length 1
## $Chromosome
## IRanges object with 76 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## [1] 59133 59139 7
## [2] 125294 125300 7
## [3] 136473 136479 7
## [4] 226640 226646 7
## [5] 266770 266776 7
## ... ... ... ...
## [72] 4139844 4139850 7
## [73] 4181244 4181250 7
## [74] 4241083 4241089 7
## [75] 4397026 4397032 7
## [76] 4473495 4473501 7
Demanding an exact match here is overly strict. vmatchPattern
has arguments allowing inexact matches. Alternatively, there is a similar function for searching for a Position Weight Matrix pattern, matchPWM
.
The following will search both strands, allowing one mismatch, and produce the result in convenient GRanges form:
query <- DNAString("AGGAGGT")
max.mismatch <- 1
fwd <- vmatchPattern(query, seqs, max.mismatch=max.mismatch)
fwd <- as(fwd, "GRanges")
strand(fwd) <- "+"
rev <- vmatchPattern(reverseComplement(query), seqs, max.mismatch=max.mismatch)
rev <- as(rev, "GRanges")
strand(rev) <- "-"
complete <- c(fwd, rev)
complete
## GRanges object with 7534 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] Chromosome 323-329 +
## [2] Chromosome 3540-3546 +
## [3] Chromosome 3765-3771 +
## [4] Chromosome 5374-5380 +
## [5] Chromosome 7641-7647 +
## ... ... ... ...
## [7530] Chromosome 4550281-4550287 -
## [7531] Chromosome 4551603-4551609 -
## [7532] Chromosome 4551732-4551738 -
## [7533] Chromosome 4552223-4552229 -
## [7534] Chromosome 4552751-4552757 -
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
# Write to GFF file
export(complete, "motif-matches.gff")
We might then view this in the IGV genome browser:
Let’s try to “discover” the Shine-Dalgarno sequence for ourselves.
# Note: bacteria do not have introns
# In a eukaryote we would need to work with transcript
# sequences (extractTranscriptSeqs) and work out where
# the CDSs start within them.
size <- 20
initiation_regions <- flank(cds, size, start=TRUE)
initiation_seqs <- getSeq(seqs, initiation_regions)
names(initiation_seqs) <- initiation_regions$gene_id
# Look for any composition bias
library(seqLogo)
letter_counts <- consensusMatrix(initiation_seqs)
probs <- prop.table(letter_counts[1:4,], 2)
seqLogo(probs, ic.scale=FALSE)
seqLogo(probs)
A popular suite of motif discovery programs is MEME. We can use MEME after writing the sequences to a file.
writeXStringSet(initiation_seqs, "init_seqs.fasta")
MEME is command-line software. Bioinformatics software, including MEME, often requires a UNIX environment such as Linux or Mac OS. Windows users may need to obtain access to a Linux server or run Linux in a “container”. To run MEME we need to type commands in a “terminal” window. This is a whole other workshop!
Bioinformatics software is often most easily installed using conda. To minimize the number of default packages installed, use “Miniconda”. Once you have installed Miniconda, you can activate conda and install MEME with:
### in a terminal ###
# conda activate
# conda install -c bioconda meme
Now we can run meme
with:
### in a terminal ###
# meme -dna init_seqs.fasta
Running meme
will produce this output.
Which genes have close matches to AGGAGGT (as we found earlier) immediately upstrand of their CDS?
Check some of the genes you find using IGV. (Once the E. coli FASTA file is loaded as the genome and the E. coli GTF file is loaded, you can search for a gene by typing its name in the location box.)
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0 Biobase_2.50.0
## [4] BSgenome_1.58.0 rtracklayer_1.50.0 GenomicRanges_1.42.0
## [7] GenomeInfoDb_1.26.4 Biostrings_2.58.0 XVector_0.30.0
## [10] IRanges_2.24.1 S4Vectors_0.28.1 BiocGenerics_0.36.0
##
## loaded via a namespace (and not attached):
## [1] MatrixGenerics_1.2.1 httr_1.4.2
## [3] sass_0.3.1 bit64_4.0.5
## [5] jsonlite_1.7.2 bslib_0.2.4
## [7] assertthat_0.2.1 askpass_1.1
## [9] BiocFileCache_1.14.0 blob_1.2.1
## [11] GenomeInfoDbData_1.2.4 Rsamtools_2.6.0
## [13] yaml_2.2.1 progress_1.2.2
## [15] pillar_1.5.1 RSQLite_2.2.4
## [17] lattice_0.20-41 glue_1.4.2
## [19] digest_0.6.27 htmltools_0.5.1.1
## [21] Matrix_1.3-2 XML_3.99-0.5
## [23] pkgconfig_2.0.3 biomaRt_2.46.3
## [25] zlibbioc_1.36.0 purrr_0.3.4
## [27] BiocParallel_1.24.1 tibble_3.1.0
## [29] openssl_1.4.3 generics_0.1.0
## [31] ellipsis_0.3.1 cachem_1.0.4
## [33] SummarizedExperiment_1.20.0 magrittr_2.0.1
## [35] crayon_1.4.1 memoise_2.0.0
## [37] evaluate_0.14 fansi_0.4.2
## [39] xml2_1.3.2 tools_4.0.3
## [41] prettyunits_1.1.1 hms_1.0.0
## [43] lifecycle_1.0.0 matrixStats_0.58.0
## [45] stringr_1.4.0 DelayedArray_0.16.2
## [47] compiler_4.0.3 jquerylib_0.1.3
## [49] rlang_0.4.10 grid_4.0.3
## [51] RCurl_1.98-1.2 rappdirs_0.3.3
## [53] bitops_1.0-6 rmarkdown_2.7
## [55] DBI_1.1.1 curl_4.3
## [57] R6_2.5.0 GenomicAlignments_1.26.0
## [59] knitr_1.31 dplyr_1.0.5
## [61] fastmap_1.1.0 bit_4.0.4
## [63] utf8_1.2.1 stringi_1.5.3
## [65] Rcpp_1.0.6 vctrs_0.3.6
## [67] dbplyr_2.1.0 tidyselect_1.1.0
## [69] xfun_0.22