This workshop looks at working with sequences, primarily DNA sequences, and genomic features.

Monash Data Fluency’s introductory R workshop focusses on “tidy” data analysis, which emphasizes storing all data in data frames. Bioconductor represents a different strand of current development in R. Bioconductor uses a great variety of data types. It’s the very opposite of tidy!

Nevertheless, Bioconductor is overwhelmingly comprehensive, and represents the most complete environment available for working with bioinformatic data currently available.

Bioconductor packages are installed using the BiocManager package. For example:

# Install BiocManager from CRAN package repository
install.packages("BiocManager")

# BiocManager can then install packages from Bioconductor package repository
BiocManager::install("Biostrings")
#or
library(BiocManager)
install("Biostrings")

Bioconductor packages usually have useful documentation in the form of “vignettes”. These are readable on the Bioconductor website, from the RStudio “help” pane, or within R:

vignette()
vignette(package="Biostrings")
vignette("BiostringsQuickOverview", package="Biostrings")

# or browse https://bioconductor.org/packages/release/BiocViews.html#___Software

1 DNA sequences and genomic ranges

Load packages we will be using:

library(Biostrings)      # Provides DNAString, DNAStringSet, etc
library(BSgenome)        # Provides getSeq()
library(GenomicRanges)   # Provides GRanges containing genomic ranges, etc
library(GenomicFeatures) # Provides TxDb objects containing genes/transcripts/exons
library(rtracklayer)     # Provides import() and export()

1.1 DNAString

Package Biostrings offers classes for storing DNA strings, DNAString, amino acid sequences, AAString, or anything else in a BString. These are like character strings, but a variety of biologically meaningful functions can be applied to them.

myseq <- DNAString("CCGCGCACCAAC")
myseq

## 12-letter DNAString object
## seq: CCGCGCACCAAC

class(myseq)

## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"

reverseComplement(myseq)

## 12-letter DNAString object
## seq: GTTGGTGCGCGG

translate(myseq)

## 4-letter AAString object
## seq: PRTN

subseq(myseq, 3,5)

## 3-letter DNAString object
## seq: GCG

myseq[3:5]

## 3-letter DNAString object
## seq: GCG

as.character(myseq)

## [1] "CCGCGCACCAAC"

You can see a complete set of functions that work with DNAString with:

methods(class="DNAString")

You can get help on the DNAString class with:

?"DNAString-class"

1.2 DNAStringSet

Often we want to work with a list of sequences, such as chromosomes.

myset <- DNAStringSet( list(chrI=myseq, chrII=DNAString("ACGTACGT")) )
myset

## DNAStringSet object of length 2:
##     width seq                                               names               
## [1]    12 CCGCGCACCAAC                                      chrI
## [2]     8 ACGTACGT                                          chrII

# A DNAStringSet is list-like
myset$chrII

## 8-letter DNAString object
## seq: ACGTACGT

# or myset[["chrII"]]
# or myset[[2]]

1.3 Challenge

Reverse complement the following DNA sequence and then translate to an amino acid sequence:

GCTTTCGTTTTCGCC

1.4 GRanges

We may then wish to refer to regions of these sequences, often with an associated strand. This is done with the GRanges type. GRanges builds on IRanges, “integer ranges”. An IRanges has starts and ends. A GRanges additionally has sequence names and strand information.

range1 <- GRanges("chrI", IRanges(start=3,end=5), "+")
range1

## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

getSeq(myset, range1)

## DNAStringSet object of length 1:
##     width seq
## [1]     3 GCG

range2 <- GRanges("chrI", IRanges(start=3,end=5), "-")
getSeq(myset, range2)

## DNAStringSet object of length 1:
##     width seq
## [1]     3 CGC

Accessing GRanges data:

seqnames(range1)

## factor-Rle of length 1 with 1 run
##   Lengths:    1
##   Values : chrI
## Levels(1): chrI

start(range1)

## [1] 3

end(range1)

## [1] 5

width(range1)

## [1] 3

strand(range1)

## factor-Rle of length 1 with 1 run
##   Lengths: 1
##   Values : +
## Levels(3): + - *

as.data.frame(range1)

##   seqnames start end width strand
## 1     chrI     3   5     3      +

Internals of an “S4” object such as a GRanges can be accessed using @, but this is discouraged. It is better to use the accessor functions above. Observe the completions RStudio offers when you type range1@.

# Look at completions for
# range1@

Further manipulations:

# GRanges are like vectors:
c(range1, range2)

## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   [2]     chrI       3-5      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# GRanges can have metadata columns, and are often used like data frames:
mcols(range1)$wobble <- 10
range1

## GRanges object with 1 range and 1 metadata column:
##       seqnames    ranges strand |    wobble
##          <Rle> <IRanges>  <Rle> | <numeric>
##   [1]     chrI       3-5      + |        10
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

mcols(range1)

## DataFrame with 1 row and 1 column
##      wobble
##   <numeric>
## 1        10

range1$wobble

## [1] 10

# A handy way to create a GRanges
as("chrI:3-5:+", "GRanges")

## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chrI       3-5      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

1.5 Question

Based on what we saw for DNAString, how can we learn more about using GRanges and IRanges objects?

2 Loading files

2.1 Loading sequences

DNA sequences are generally stored in FASTA format, a simple text format. These can be loaded with readDNAStringSet from Biostrings. Let’s load the genome of E. coli strain K-12, obtained from the Ensembl FTP site.

### The start of the .fa file looks like this:
# >Chromosome dna:chromosome chromosome:GCA_000800765.1:Chromosome:1:4558660:1
# AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
# TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
# TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
# ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
# AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
# CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT
# ...

seqs <- readDNAStringSet("r-bioc-files/Escherichia_coli_k_12.GCA_000800765.1.29.dna.genome.fa")
seqs

## DNAStringSet object of length 1:
##       width seq                                             names               
## [1] 4558660 AGCTTTTCATTCTGACTGCAAC...AACGCCTTAGTAAGTATTTTTC Chromosome dna:ch...

# Our chromosome name is too verbose.
# Remove everything from the name after the first space.
names(seqs)

## [1] "Chromosome dna:chromosome chromosome:GCA_000800765.1:Chromosome:1:4558660:1"

names(seqs) <- sub(" .*","",names(seqs))
names(seqs)

## [1] "Chromosome"

Conversely, a DNAStringSet can also be written to a file with writeXStringSet.

2.2 Loading features

Genome annotations are available in a variety of text formats such as GFF3 and GTF. They can be loaded with the import function from rtracklayer. This GTF file is also from Ensembl, and gives the locations of the genes in the genome, and features within them.

### The start of the .gtf file looks like this:
# #!genome-build ASM80076v1
# #!genome-version GCA_000800765.1
# #!genome-date 2014-12
# #!genome-build-accession GCA_000800765.1
# #!genebuild-last-updated 2014-12
# Chromosome      ena     gene    190     255     .       +       .       gene_id "ER3413_4519"; gene_version "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding";
# Chromosome      ena     transcript      190     255     .       +       .       gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding";
# Chromosome      ena     exon    190     255     .       +       .       gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; exon_number "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AIZ54182-1"; exon_version "1";
# Chromosome      ena     CDS     190     252     .       +       0       gene_id "ER3413_4519"; gene_version "1"; transcript_id "AIZ54182"; transcript_version "1"; exon_number "1"; gene_name "thrL"; gene_source "ena"; gene_biotype "protein_coding"; transcript_name "thrL-1"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AIZ54182"; protein_version "1";
# ...

features <- import("r-bioc-files/Escherichia_coli_k_12.GCA_000800765.1.29.gtf")

# Optional: just retain the columns of metadata we need
mcols(features) <- mcols(features)[,c("type","gene_name","gene_id","transcript_id","exon_id")]

features

## GRanges object with 24926 ranges and 5 metadata columns:
##             seqnames          ranges strand |        type   gene_name
##                <Rle>       <IRanges>  <Rle> |    <factor> <character>
##       [1] Chromosome         190-255      + | gene               thrL
##       [2] Chromosome         190-255      + | transcript         thrL
##       [3] Chromosome         190-255      + | exon               thrL
##       [4] Chromosome         190-252      + | CDS                thrL
##       [5] Chromosome         190-192      + | start_codon        thrL
##       ...        ...             ...    ... .         ...         ...
##   [24922] Chromosome 4557950-4558636      + | transcript         yjtD
##   [24923] Chromosome 4557950-4558636      + | exon               yjtD
##   [24924] Chromosome 4557950-4558633      + | CDS                yjtD
##   [24925] Chromosome 4557950-4557952      + | start_codon        yjtD
##   [24926] Chromosome 4558634-4558636      + | stop_codon         yjtD
##               gene_id transcript_id     exon_id
##           <character>   <character> <character>
##       [1] ER3413_4519          <NA>        <NA>
##       [2] ER3413_4519      AIZ54182        <NA>
##       [3] ER3413_4519      AIZ54182  AIZ54182-1
##       [4] ER3413_4519      AIZ54182        <NA>
##       [5] ER3413_4519      AIZ54182        <NA>
##       ...         ...           ...         ...
##   [24922] ER3413_4514      AIZ54177        <NA>
##   [24923] ER3413_4514      AIZ54177  AIZ54177-1
##   [24924] ER3413_4514      AIZ54177        <NA>
##   [24925] ER3413_4514      AIZ54177        <NA>
##   [24926] ER3413_4514      AIZ54177        <NA>
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Conversely, a GRanges can be written to a file with export.

2.3 Seqinfo and genome versions

Various objects have associated Seqinfo information, listing the chromosomes they refer to and their lengths. This allows some extra sanity checking, and is also necessary for some tasks.

seqinfo(features)

## Seqinfo object with 1 sequence from an unspecified genome; no seqlengths:
##   seqnames   seqlengths isCircular genome
##   Chromosome         NA         NA   <NA>

seqinfo(seqs)

## Seqinfo object with 1 sequence from an unspecified genome:
##   seqnames   seqlengths isCircular genome
##   Chromosome    4558660         NA   <NA>

myseqinfo <- seqinfo(seqs)
isCircular(myseqinfo) <- c(TRUE)

seqinfo(features) <- myseqinfo

One thing you need to care about is the genome version. We often refer to genome versions by their name in the UCSC genome browser. For example the two most recent versions of the human genome are “hg19” from 2009 and “hg38” from 2013.

Seqinfo(genome="hg19")

## Seqinfo object with 298 sequences (2 circular) from hg19 genome:
##   seqnames           seqlengths isCircular genome
##   chr1                249250621      FALSE   hg19
##   chr2                243199373      FALSE   hg19
##   chr3                198022430      FALSE   hg19
##   chr4                191154276      FALSE   hg19
##   chr5                180915260      FALSE   hg19
##   ...                       ...        ...    ...
##   chr21_gl383580_alt      74652      FALSE   hg19
##   chr21_gl383581_alt     116690      FALSE   hg19
##   chr22_gl383582_alt     162811      FALSE   hg19
##   chr22_gl383583_alt      96924      FALSE   hg19
##   chr22_kb663609_alt      74013      FALSE   hg19

Seqinfo(genome="hg38")

## Seqinfo object with 595 sequences (1 circular) from hg38 genome:
##   seqnames             seqlengths isCircular genome
##   chr1                  248956422      FALSE   hg38
##   chr2                  242193529      FALSE   hg38
##   chr3                  198295559      FALSE   hg38
##   chr4                  190214555      FALSE   hg38
##   chr5                  181538259      FALSE   hg38
##   ...                         ...        ...    ...
##   chr22_KN196486v1_alt     153027      FALSE   hg38
##   chr22_KQ458387v1_alt     155930      FALSE   hg38
##   chr22_KQ458388v1_alt     174749      FALSE   hg38
##   chr22_KQ759761v1_alt     145162      FALSE   hg38
##   chrX_KV766199v1_alt      188004      FALSE   hg38

Notice the slightly different chromosome lengths! Features from one genome version will not be correctly located on a different genome version.

2.4 Getting sequences

We can use these annotations to grab sequences from the genome.

feat <- features[4]
feat

## GRanges object with 1 range and 5 metadata columns:
##         seqnames    ranges strand |     type   gene_name     gene_id
##            <Rle> <IRanges>  <Rle> | <factor> <character> <character>
##   [1] Chromosome   190-252      + |      CDS        thrL ER3413_4519
##       transcript_id     exon_id
##         <character> <character>
##   [1]      AIZ54182        <NA>
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome

feat_seq <- getSeq(seqs, feat)
feat_seq

## DNAStringSet object of length 1:
##     width seq
## [1]    63 ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGC

translate(feat_seq)

## AAStringSet object of length 1:
##     width seq
## [1]    21 MKRISTTITTTITITTGNGAG

The metadata columns let us query the GRanges, for example for a particular gene.

subset(features, gene_name == "lacA")
# Equivalently:
#   features[features$gene_name == "lacA" & !is.na(features$gene_name)]

## GRanges object with 6 ranges and 5 metadata columns:
##         seqnames        ranges strand |        type   gene_name     gene_id
##            <Rle>     <IRanges>  <Rle> |    <factor> <character> <character>
##   [1] Chromosome 363147-363758      - | gene               lacA  ER3413_350
##   [2] Chromosome 363147-363758      - | transcript         lacA  ER3413_350
##   [3] Chromosome 363147-363758      - | exon               lacA  ER3413_350
##   [4] Chromosome 363150-363758      - | CDS                lacA  ER3413_350
##   [5] Chromosome 363756-363758      - | start_codon        lacA  ER3413_350
##   [6] Chromosome 363147-363149      - | stop_codon         lacA  ER3413_350
##       transcript_id     exon_id
##         <character> <character>
##   [1]          <NA>        <NA>
##   [2]      AIZ53204        <NA>
##   [3]      AIZ53204  AIZ53204-1
##   [4]      AIZ53204        <NA>
##   [5]      AIZ53204        <NA>
##   [6]      AIZ53204        <NA>
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome

Note: subset is a generic R function. It is also similar to dplyr’s filter. The second argument is special, in it you can refer to columns of the GRanges directly.

We could also get all features of a particular type.

cds <- subset(features, type == "CDS")
cds
# Equivalently:
#   features[features$type == "CDS"]

## GRanges object with 4052 ranges and 5 metadata columns:
##            seqnames          ranges strand |     type   gene_name     gene_id
##               <Rle>       <IRanges>  <Rle> | <factor> <character> <character>
##      [1] Chromosome         190-252      + |      CDS        thrL ER3413_4519
##      [2] Chromosome        337-2796      + |      CDS        thrA    ER3413_1
##      [3] Chromosome       2801-3730      + |      CDS        thrB    ER3413_2
##      [4] Chromosome       3734-5017      + |      CDS        thrC    ER3413_3
##      [5] Chromosome       5234-5527      + |      CDS        yaaX    ER3413_4
##      ...        ...             ...    ... .      ...         ...         ...
##   [4048] Chromosome 4553704-4555125      + |      CDS        creC ER3413_4511
##   [4049] Chromosome 4555186-4556535      + |      CDS        creD ER3413_4512
##   [4050] Chromosome 4556601-4557314      - |      CDS        arcA ER3413_4513
##   [4051] Chromosome 4557410-4557547      + |      CDS        yjjY ER3413_4541
##   [4052] Chromosome 4557950-4558633      + |      CDS        yjtD ER3413_4514
##          transcript_id     exon_id
##            <character> <character>
##      [1]      AIZ54182        <NA>
##      [2]      AIZ50783        <NA>
##      [3]      AIZ51761        <NA>
##      [4]      AIZ52710        <NA>
##      [5]      AIZ53685        <NA>
##      ...           ...         ...
##   [4048]      AIZ54174        <NA>
##   [4049]      AIZ54175        <NA>
##   [4050]      AIZ54176        <NA>
##   [4051]      AIZ54207        <NA>
##   [4052]      AIZ54177        <NA>
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome

3 Further operations on GRanges

3.1 Intra-range

Various useful manipulations of individual ranges are defined.

?"intra-range-methods"

Note: How these make use of the strand is a little haphazard. For example flank() and resize() respect strand but shift() does not.

Earlier we translated a coding sequence. Coding sequences are terminated by a stop codon. Let’s extend the CDS feature to include this.

feat <- features[4]
feat_stop <- resize(feat, width(feat)+3)
seq_stop <- getSeq(seqs, feat_stop)
translate(seq_stop)

## AAStringSet object of length 1:
##     width seq
## [1]    22 MKRISTTITTTITITTGNGAG*

resize can fix either the fix="start" or fix="end" of the sequence.

flank can be either flank the start (start=TRUE) or end (start=FALSE).

#                                  5'     3'
# input                            ------->
#                                  .      .
# resize(input, n, fix="start")    -->    .
# resize(input, n, fix="end")      .    -->
#                                  .      .
# flank(input, n, start=TRUE)   -->.      .
# flank(input, n, start=FALSE)     .      .-->
#                                  .      .
# promoters(input, n1,n2)     -------->   .
#

3.2 Inter-range

?"inter-range-methods"
?"nearest-methods"
?"setops-methods"

One compelling feature of GenomicRanges is that it is able to find overlapping ranges very quickly.

query <- as("Chromosome:9500-10000:+", "GRanges")
hits <- findOverlaps(query, features, ignore.strand=TRUE)
hits

## Hits object with 10 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         1          49
##    [2]         1          50
##    [3]         1          51
##    [4]         1          52
##    [5]         1          54
##    [6]         1          55
##    [7]         1          56
##    [8]         1          57
##    [9]         1          58
##   [10]         1          60
##   -------
##   queryLength: 1 / subjectLength: 24926

subjectHits(hits)

##  [1] 49 50 51 52 54 55 56 57 58 60

features[subjectHits(hits)]

## GRanges object with 10 ranges and 5 metadata columns:
##          seqnames     ranges strand |       type   gene_name     gene_id
##             <Rle>  <IRanges>  <Rle> |   <factor> <character> <character>
##    [1] Chromosome  9306-9893      + | gene               mog    ER3413_8
##    [2] Chromosome  9306-9893      + | transcript         mog    ER3413_8
##    [3] Chromosome  9306-9893      + | exon               mog    ER3413_8
##    [4] Chromosome  9306-9890      + | CDS                mog    ER3413_8
##    [5] Chromosome  9891-9893      + | stop_codon         mog    ER3413_8
##    [6] Chromosome 9928-10494      - | gene              yaaH    ER3413_9
##    [7] Chromosome 9928-10494      - | transcript        yaaH    ER3413_9
##    [8] Chromosome 9928-10494      - | exon              yaaH    ER3413_9
##    [9] Chromosome 9931-10494      - | CDS               yaaH    ER3413_9
##   [10] Chromosome  9928-9930      - | stop_codon        yaaH    ER3413_9
##        transcript_id     exon_id
##          <character> <character>
##    [1]          <NA>        <NA>
##    [2]      AIZ54621        <NA>
##    [3]      AIZ54621  AIZ54621-1
##    [4]      AIZ54621        <NA>
##    [5]      AIZ54621        <NA>
##    [6]          <NA>        <NA>
##    [7]      AIZ54730        <NA>
##    [8]      AIZ54730  AIZ54730-1
##    [9]      AIZ54730        <NA>
##   [10]      AIZ54730        <NA>
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome

findOverlaps(query, features, ignore.strand=FALSE)

## Hits object with 5 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         1          49
##   [2]         1          50
##   [3]         1          51
##   [4]         1          52
##   [5]         1          54
##   -------
##   queryLength: 1 / subjectLength: 24926

With findOverlaps, we can use genomic location as the key when joining disparate types of data, so this is an important tool for integrative analysis. See also the related functions nearest, precede, follow, and distance.

GenomicRanges also provides:

range - get a feature that spans from the start to the end of all features in a GRanges.
reduce - merge overlapping features, so that the same bases are covered by a reduced collection of features.
disjoin - as with reduce, but broken at each start and end of the input features.
setdiff - subtracts one set of features from another, could be used with range on a set of exons to get introns. Might need to use GenomicRanges::setdiff if also using dplyr.

# input
#         --------->
#              ----------------->
#                      ------>
#                                     ---->
#                                     
# range   -------------------------------->
# 
# reduce  ----------------------->    ---->
# 
# disjoin ---->---->-->------>--->    ---->
# 
# setdiff(range(input),input)     --->

3.3 Challenge

What are E. coli’s most common start and stop codons?

The start codon is the first three bases of the CDS, and the stop codon is the three bases following the end of the CDS.

Hint: Recall that we could get all CDS ranges with:

cds <- subset(features, type == "CDS")

Hint: Use resize(..., ..., fix="start") and flank(..., ..., start=FALSE) to manipulate these ranges.

Hint: Use table() to count occurrences of different strings.

3.4 Transcript database objects

We’ve been using our genomic features as one big unstructured GRanges. This is messy. Furthermore, eukaryote genes contain exons and introns, introducing complications not seen in bacteria.

Let’s look at the features associated with the gene “lacY”.

subset(features, gene_name == "lacY")

## GRanges object with 6 ranges and 5 metadata columns:
##         seqnames        ranges strand |        type   gene_name     gene_id
##            <Rle>     <IRanges>  <Rle> |    <factor> <character> <character>
##   [1] Chromosome 363824-365077      - | gene               lacY  ER3413_351
##   [2] Chromosome 363824-365077      - | transcript         lacY  ER3413_351
##   [3] Chromosome 363824-365077      - | exon               lacY  ER3413_351
##   [4] Chromosome 363827-365077      - | CDS                lacY  ER3413_351
##   [5] Chromosome 365075-365077      - | start_codon        lacY  ER3413_351
##   [6] Chromosome 363824-363826      - | stop_codon         lacY  ER3413_351
##       transcript_id     exon_id
##         <character> <character>
##   [1]          <NA>        <NA>
##   [2]      AIZ53215        <NA>
##   [3]      AIZ53215  AIZ53215-1
##   [4]      AIZ53215        <NA>
##   [5]      AIZ53215        <NA>
##   [6]      AIZ53215        <NA>
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome

Look at the different types in the “type” column. Each “gene” may have multiple “transcript” features (isoforms). Each transcript in turn has a set of “exon” features, and if it is a protein coding gene, a set of “CDS” (coding sequence) features. The CDS features cover a subset of the bases covered by the exon features.

# --------------------------------------------------> gene
# 
# -------------------------------------------->       transcript
# ---------->         --->    ---------------->       exon
#       ---->         --->    ---------->             CDS
# 
# 
#                -----------------------------------> transcript
#                -------->       ---->    ----------> exon
#                     --->       ---->    -->         CDS

txdb <- makeTxDbFromGRanges(features)
txdb

## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Genome: NA
## # Nb of transcripts: 4257
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2021-03-24 14:05:15 +1100 (Wed, 24 Mar 2021)
## # GenomicFeatures version at creation time: 1.42.2
## # RSQLite version at creation time: 2.2.4
## # DBSCHEMAVERSION: 1.2

txdb is a TxDb object.

genes(txdb)

## GRanges object with 4257 ranges and 1 metadata column:
##                 seqnames          ranges strand |     gene_id
##                    <Rle>       <IRanges>  <Rle> | <character>
##      ER3413_1 Chromosome        337-2799      + |    ER3413_1
##     ER3413_10 Chromosome     10643-11356      - |   ER3413_10
##    ER3413_100 Chromosome   113444-114487      + |  ER3413_100
##   ER3413_1000 Chromosome 1033581-1034438      + | ER3413_1000
##   ER3413_1001 Chromosome 1034572-1036116      + | ER3413_1001
##           ...        ...             ...    ... .         ...
##    ER3413_995 Chromosome 1028971-1030089      + |  ER3413_995
##    ER3413_996 Chromosome 1030086-1031879      + |  ER3413_996
##    ER3413_997 Chromosome 1031898-1032605      + |  ER3413_997
##    ER3413_998 Chromosome 1032602-1033189      + |  ER3413_998
##    ER3413_999 Chromosome 1033186-1033584      + |  ER3413_999
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome

transcriptsBy(txdb, by="gene")

## GRangesList object of length 4257:
## $ER3413_1
## GRanges object with 1 range and 2 metadata columns:
##         seqnames    ranges strand |     tx_id     tx_name
##            <Rle> <IRanges>  <Rle> | <integer> <character>
##   [1] Chromosome  337-2799      + |         2    AIZ50783
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome
## 
## $ER3413_10
## GRanges object with 1 range and 2 metadata columns:
##         seqnames      ranges strand |     tx_id     tx_name
##            <Rle>   <IRanges>  <Rle> | <integer> <character>
##   [1] Chromosome 10643-11356      - |      2087    AIZ50784
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome
## 
## $ER3413_100
## GRanges object with 1 range and 2 metadata columns:
##         seqnames        ranges strand |     tx_id     tx_name
##            <Rle>     <IRanges>  <Rle> | <integer> <character>
##   [1] Chromosome 113444-114487      + |        62    AIZ50785
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome
## 
## ...
## <4254 more elements>

exonsBy(txdb, use.names=TRUE)

## GRangesList object of length 4257:
## $AIZ54182
## GRanges object with 1 range and 3 metadata columns:
##         seqnames    ranges strand |   exon_id   exon_name exon_rank
##            <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1] Chromosome   190-255      + |         1  AIZ54182-1         1
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome
## 
## $AIZ50783
## GRanges object with 1 range and 3 metadata columns:
##         seqnames    ranges strand |   exon_id   exon_name exon_rank
##            <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1] Chromosome  337-2799      + |         2  AIZ50783-1         1
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome
## 
## $AIZ51761
## GRanges object with 1 range and 3 metadata columns:
##         seqnames    ranges strand |   exon_id   exon_name exon_rank
##            <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1] Chromosome 2801-3733      + |         3  AIZ51761-1         1
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome
## 
## ...
## <4254 more elements>

cdsBy(txdb, use.names=TRUE)

## GRangesList object of length 4051:
## $AIZ54182
## GRanges object with 1 range and 3 metadata columns:
##         seqnames    ranges strand |    cds_id    cds_name exon_rank
##            <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1] Chromosome   190-255      + |         1        <NA>         1
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome
## 
## $AIZ50783
## GRanges object with 1 range and 3 metadata columns:
##         seqnames    ranges strand |    cds_id    cds_name exon_rank
##            <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1] Chromosome  337-2799      + |         2        <NA>         1
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome
## 
## $AIZ51761
## GRanges object with 1 range and 3 metadata columns:
##         seqnames    ranges strand |    cds_id    cds_name exon_rank
##            <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1] Chromosome 2801-3733      + |         3        <NA>         1
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome
## 
## ...
## <4048 more elements>

cds_ranges <- cdsBy(txdb, use.names=TRUE)
cds_ranges$AIZ54182

## GRanges object with 1 range and 3 metadata columns:
##         seqnames    ranges strand |    cds_id    cds_name exon_rank
##            <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1] Chromosome   190-255      + |         1        <NA>         1
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome

cds_ranges[[1]]

## GRanges object with 1 range and 3 metadata columns:
##         seqnames    ranges strand |    cds_id    cds_name exon_rank
##            <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
##   [1] Chromosome   190-255      + |         1        <NA>         1
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome

unlist(cds_ranges)

## GRanges object with 4052 ranges and 3 metadata columns:
##              seqnames          ranges strand |    cds_id    cds_name exon_rank
##                 <Rle>       <IRanges>  <Rle> | <integer> <character> <integer>
##   AIZ54182 Chromosome         190-255      + |         1        <NA>         1
##   AIZ50783 Chromosome        337-2799      + |         2        <NA>         1
##   AIZ51761 Chromosome       2801-3733      + |         3        <NA>         1
##   AIZ52710 Chromosome       3734-5020      + |         4        <NA>         1
##   AIZ53685 Chromosome       5234-5530      + |         5        <NA>         1
##        ...        ...             ...    ... .       ...         ...       ...
##   AIZ54159 Chromosome 4541153-4541797      - |      4048        <NA>         1
##   AIZ54165 Chromosome 4545863-4547530      - |      4049        <NA>         1
##   AIZ54168 Chromosome 4550241-4550753      - |      4050        <NA>         1
##   AIZ54169 Chromosome 4551449-4552318      - |      4051        <NA>         1
##   AIZ54176 Chromosome 4556598-4557314      - |      4052        <NA>         1
##   -------
##   seqinfo: 1 sequence (1 circular) from an unspecified genome

cds_ranges is a GRangesList. That is, a list containing GRanges objects. As this is a bacteria, each transcript only has a single range, but in a eukaryote each transcript could consist of many ranges. To get the transcript sequence or the coding sequence, these each need to be retrieved and then concatenated together. extractTranscriptSeqs can do this for us.

extractTranscriptSeqs(seqs, cds_ranges)

## DNAStringSet object of length 4051:
##        width seq                                            names               
##    [1]    66 ATGAAACGCATTAGCACCACCA...ACAGGTAACGGTGCGGGCTGA AIZ54182
##    [2]  2463 ATGCGAGTGTTGAAGTTCGGCG...TCATGGAAGTTAGGAGTCTGA AIZ50783
##    [3]   933 ATGGTTAAAGTTTATGCCCCGG...GCACGAGTACTGGAAAACTAA AIZ51761
##    [4]  1287 ATGAAACTCTACAATCTGAAAG...TTGATGATGAATCATCAGTAA AIZ52710
##    [5]   297 GTGAAAAAGATGCAATCTATCG...CCAGGCAAACATCACCGCTAA AIZ53685
##    ...   ... ...
## [4047]   645 ATGGCTCGCACAAAACTGAAAT...GAAAGCGAGAAAAAAGAGTGA AIZ54159
## [4048]  1668 GTGGCTCAATTCGTTTATACCA...TACAAGCGTATTGCGAAGTAA AIZ54165
## [4049]   513 ATGCACCAAGTTGTCTGTGCGA...TTTCATAATGCCGTGTATTAA AIZ54168
## [4050]   870 ATGGATCAGGCCGGCATTATTC...CTGATTCCGATCCGTCGTTAA AIZ54169
## [4051]   717 ATGCAGACCCCGCACATTCTTA...TGCGGTGATCTGGAAGATTAA AIZ54176

There’s much more to explore here. Have a look at the documentation for the GenomicFeatures and ensembldb packages. To annotate a set of genomic features such as peaks from a ChIP-seq experiment, see for example the ChIPseeker package.

4 Genome and annotation resources

Besides software, Bioconductor includes various types of data. An important type of data is data describing model organisms. This is either supplied through data packages or through the newer AnnotationHub system. It is generally derived from these central repositories:

The NCBI’s RefSeq
The EBI’s Ensembl genome browser
The UCSC genome browser

UCSC was the original web-based genome browser. UCSC’s “KnownGene” gene annotations used to be the cutting edge gene annotation source, but UCSC now relies on other sources for gene annotations. Many file types that remain important today were developed for the UCSC genome browser, such as “bed”, “bigWig”, and “2bit”.

These organizations will generally obtain genome assemblies from the same ultimate sources. For example, all of the above use the Genome Reference Consortium’s GRCh38 DNA sequence for homo sapiens. UCSC calls this “hg38” but it is the same DNA sequence. These DNA sequences serve as a common frame of reference. However the three organizations above will differ on their exact set of gene and transcript annotations, and use different gene and transcript ID systems.

Genome assemblies are released infrequently. GRCh38 (hg38) was released in 2013. The previous assembly, GRCh37 (hg19) was released in 2009. Some people haven’t updated yet, you will find plenty of data using “hg19” positions! Gene and transcript annotations are updated far more frequently.

As well as the chromosomes in the “primary assembly” a genome assembly may have further sequences, which may have been added after the initial release:

patch sequences: fixes that would change the sizes of chromosomes
alt loci: a way to represent alleles, genetic diversity in the species

4.1 Example packages

BSgenome.Hsapiens.UCSC.hg38 Biostrings genome, Homo sapiens, from the UCSC browser, version hg38. DNA for chromosomes, usable in the same way as the DNAStringSet used above.
TxDb.Hsapiens.UCSC.hg38.knownGene Transcript database, Homo sapiens, from UCSC browser, genome verison hg38, “knownGene” gene annotations. GRanges information for genes and transcripts, much as we loaded from a GTF file above.
org.Hs.eg.db Organism Homo sapiens, primary key is Entrez Gene, database. Translation of gene ids from various databases, assignment to GO terms, KEGG pathways, etc. Entrez Gene ids are used as the primary key.
GO.db GO Gene Ontology term descriptions and relationships. However to find which terms apply to specific genes, use the GOALL column in the relevant species’ organism database.

4.2 biomaRt

BioMart servers accessed using the biomaRt package are another way to get information such as translation of gene ids, gene sets, and gene information.

Beware replicability. BioMart servers are not guaranteed to be available forever, and BioMart doesn’t have a clear story around versioning.

4.3 AnnotationHub

AnnotationHub is a way to retrieve data from a more comprehensive set of organisms and data providers than are provided as individual Bioconductor packages. The retrieved data is returned in an appropriate Bioconductor type. If data is being updated over time (eg improved annotation of a genome), each version receives a unique ID in AnnotationHub, making it possible to write reproducable analyses.

Files are cached, so they will only be downloaded once.

In the example below, the yeast genome and Ensembl annotations are retrieved:

library(AnnotationHub)
ah <- AnnotationHub()

# ah contains a large collection of records that can be retrieved
ah
length(ah)
colnames( mcols(ah) )
table( ah$rdataclass )

# There is an interactive Shiny search interface
display(ah)

# query() searches for terms in an unstructured way
records <- query(ah, c("Ensembl", "96", "Saccharomyces cerevisiae"))
records

mcols(records)
mcols(records)[,c("title","rdataclass")]

# Having located records of interest,
# your R script can refer to the specific AH... record,
# so it always uses the same version of the data.
sc_genome <- ah[["AH70449"]]
sc_granges <- ah[["AH69700"]]
sc_txdb <- ah[["AH69265"]]

# sc_genome is a TwoBitFile
# Can use getSeq on it without loading everything into memory
seqinfo(sc_genome)
getSeq(sc_genome, as("IV:1-10","GRanges"))
import(sc_genome)

# An OrgDb contains information about genes in an organism, and lets you map between different identifiers
query(ah, c("OrgDb", "Saccharomyces cerevisiae"))
sc_orgdb <- ah[["AH84129"]]
columns(sc_orgdb)
head( keys(sc_orgdb, "ENSEMBL") )
select(sc_orgdb, "YFL039C", keytype="ENSEMBL", columns=c("GENENAME", "DESCRIPTION"))

# As well as IDs, genes have short, easy to remember "symbols" (also often called "names")
# We can use the OrgDb to look up gene IDs from symbols
# Notice a problem here!
select(sc_orgdb, c("ACT1", "COS2"), keytype="GENENAME",  columns=c("ENSEMBL"))

5 Example: finding and discovering motifs

5.1 Finding a known motif

AGGAGGU is the Shine-Dalgarno sequence, which assists binding of the ribosome to a transcript.

vmatchPattern("AGGAGGT", seqs)

## MIndex object of length 1
## $Chromosome
## IRanges object with 63 ranges and 0 metadata columns:
##            start       end     width
##        <integer> <integer> <integer>
##    [1]     56593     56599         7
##    [2]     67347     67353         7
##    [3]    226876    226882         7
##    [4]    229408    229414         7
##    [5]    241665    241671         7
##    ...       ...       ...       ...
##   [59]   4312631   4312637         7
##   [60]   4371930   4371936         7
##   [61]   4410503   4410509         7
##   [62]   4420666   4420672         7
##   [63]   4484025   4484031         7

vmatchPattern is strand specific. If we want matches on the reverse strand we need to also:

vmatchPattern(reverseComplement(DNAString("AGGAGGT")), seqs)

## MIndex object of length 1
## $Chromosome
## IRanges object with 76 ranges and 0 metadata columns:
##            start       end     width
##        <integer> <integer> <integer>
##    [1]     59133     59139         7
##    [2]    125294    125300         7
##    [3]    136473    136479         7
##    [4]    226640    226646         7
##    [5]    266770    266776         7
##    ...       ...       ...       ...
##   [72]   4139844   4139850         7
##   [73]   4181244   4181250         7
##   [74]   4241083   4241089         7
##   [75]   4397026   4397032         7
##   [76]   4473495   4473501         7

Demanding an exact match here is overly strict. vmatchPattern has arguments allowing inexact matches. Alternatively, there is a similar function for searching for a Position Weight Matrix pattern, matchPWM.

The following will search both strands, allowing one mismatch, and produce the result in convenient GRanges form:

query <- DNAString("AGGAGGT")
max.mismatch <- 1

fwd <- vmatchPattern(query, seqs, max.mismatch=max.mismatch)
fwd <- as(fwd, "GRanges")
strand(fwd) <- "+"
rev <- vmatchPattern(reverseComplement(query), seqs, max.mismatch=max.mismatch)
rev <- as(rev, "GRanges")
strand(rev) <- "-"

complete <- c(fwd, rev)
complete

## GRanges object with 7534 ranges and 0 metadata columns:
##            seqnames          ranges strand
##               <Rle>       <IRanges>  <Rle>
##      [1] Chromosome         323-329      +
##      [2] Chromosome       3540-3546      +
##      [3] Chromosome       3765-3771      +
##      [4] Chromosome       5374-5380      +
##      [5] Chromosome       7641-7647      +
##      ...        ...             ...    ...
##   [7530] Chromosome 4550281-4550287      -
##   [7531] Chromosome 4551603-4551609      -
##   [7532] Chromosome 4551732-4551738      -
##   [7533] Chromosome 4552223-4552229      -
##   [7534] Chromosome 4552751-4552757      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

# Write to GFF file
export(complete, "motif-matches.gff")

We might then view this in the IGV genome browser:

5.2 De novo motif finding

Let’s try to “discover” the Shine-Dalgarno sequence for ourselves.

# Note: bacteria do not have introns
# In a eukaryote we would need to work with transcript 
# sequences (extractTranscriptSeqs) and work out where 
# the CDSs start within them.

size <- 20

initiation_regions <- flank(cds, size, start=TRUE)
initiation_seqs <- getSeq(seqs, initiation_regions)
names(initiation_seqs) <- initiation_regions$gene_id

# Look for any composition bias
library(seqLogo)
letter_counts <- consensusMatrix(initiation_seqs)
probs <- prop.table(letter_counts[1:4,], 2)
seqLogo(probs, ic.scale=FALSE)

seqLogo(probs)

A popular suite of motif discovery programs is MEME. We can use MEME after writing the sequences to a file.

writeXStringSet(initiation_seqs, "init_seqs.fasta")

MEME is command-line software. Bioinformatics software, including MEME, often requires a UNIX environment such as Linux or Mac OS. Windows users may need to obtain access to a Linux server or run Linux in a “container”. To run MEME we need to type commands in a “terminal” window. This is a whole other workshop!

Bioinformatics software is often most easily installed using conda. To minimize the number of default packages installed, use “Miniconda”. Once you have installed Miniconda, you can activate conda and install MEME with:

### in a terminal ###
# conda activate
# conda install -c bioconda meme

Now we can run meme with:

### in a terminal ###
# meme -dna init_seqs.fasta

Running meme will produce this output.

5.3 Challenge

Which genes have close matches to AGGAGGT (as we found earlier) immediately upstrand of their CDS?

Check some of the genes you find using IGV. (Once the E. coli FASTA file is loaded as the genome and the E. coli GTF file is loaded, you can search for a gene by typing its name in the location box.)

sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GenomicFeatures_1.42.2 AnnotationDbi_1.52.0   Biobase_2.50.0        
##  [4] BSgenome_1.58.0        rtracklayer_1.50.0     GenomicRanges_1.42.0  
##  [7] GenomeInfoDb_1.26.4    Biostrings_2.58.0      XVector_0.30.0        
## [10] IRanges_2.24.1         S4Vectors_0.28.1       BiocGenerics_0.36.0   
## 
## loaded via a namespace (and not attached):
##  [1] MatrixGenerics_1.2.1        httr_1.4.2                 
##  [3] sass_0.3.1                  bit64_4.0.5                
##  [5] jsonlite_1.7.2              bslib_0.2.4                
##  [7] assertthat_0.2.1            askpass_1.1                
##  [9] BiocFileCache_1.14.0        blob_1.2.1                 
## [11] GenomeInfoDbData_1.2.4      Rsamtools_2.6.0            
## [13] yaml_2.2.1                  progress_1.2.2             
## [15] pillar_1.5.1                RSQLite_2.2.4              
## [17] lattice_0.20-41             glue_1.4.2                 
## [19] digest_0.6.27               htmltools_0.5.1.1          
## [21] Matrix_1.3-2                XML_3.99-0.5               
## [23] pkgconfig_2.0.3             biomaRt_2.46.3             
## [25] zlibbioc_1.36.0             purrr_0.3.4                
## [27] BiocParallel_1.24.1         tibble_3.1.0               
## [29] openssl_1.4.3               generics_0.1.0             
## [31] ellipsis_0.3.1              cachem_1.0.4               
## [33] SummarizedExperiment_1.20.0 magrittr_2.0.1             
## [35] crayon_1.4.1                memoise_2.0.0              
## [37] evaluate_0.14               fansi_0.4.2                
## [39] xml2_1.3.2                  tools_4.0.3                
## [41] prettyunits_1.1.1           hms_1.0.0                  
## [43] lifecycle_1.0.0             matrixStats_0.58.0         
## [45] stringr_1.4.0               DelayedArray_0.16.2        
## [47] compiler_4.0.3              jquerylib_0.1.3            
## [49] rlang_0.4.10                grid_4.0.3                 
## [51] RCurl_1.98-1.2              rappdirs_0.3.3             
## [53] bitops_1.0-6                rmarkdown_2.7              
## [55] DBI_1.1.1                   curl_4.3                   
## [57] R6_2.5.0                    GenomicAlignments_1.26.0   
## [59] knitr_1.31                  dplyr_1.0.5                
## [61] fastmap_1.1.0               bit_4.0.4                  
## [63] utf8_1.2.1                  stringi_1.5.3              
## [65] Rcpp_1.0.6                  vctrs_0.3.6                
## [67] dbplyr_2.1.0                tidyselect_1.1.0           
## [69] xfun_0.22

Working with sequences and genomic features