“Data frame” is R’s name for tabular data. We generally want each row in a data frame to represent a unit of observation, and each column to contain a different type of information about the units of observation. Tabular data in this form is called “tidy data”.
This section uses a collection of modern packages collectively known as the Tidyverse. R and its predecessor S have a history dating back to 1976. The Tidyverse fixes some dubious design decisions baked into “base R”, including having its own slightly improved form of data frame. Sticking to the Tidyverse where possible is generally safer, Tidyverse packages are more willing to generate errors rather than ignore problems.
If the Tidyverse is not already installed, you will need to install it. However on the server we are using today it is already installed.
# install.packages("tidyverse")
We need to load this package in order to use it.
library(tidyverse)
The tidyverse
package loads various other packages, setting up a modern R environment. In this section we will be using functions from the readr
and dplyr
packages.
R is a language with mini-languages within it that solve specific problem domains. dplyr
is such a mini-language, a set of “verbs” (functions) that work well together. dplyr
, with the help of tidyr
for some more complex operations, provides a way to perform most manipulations on a data frame that you might need.
We will use the read_csv
function from readr
. (See also read.csv
in base R.)
gap <- read_csv("gapminder.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
gap
## # A tibble: 1,704 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414
## # ... with 1,694 more rows
This is data from Gapminder on life expectancy over time in different countries. The “unit of observation” is a country in a particular year.
Note: “tibble” refers to the Tidyverse’s improved form of data frame.
The View
function gives us a spreadsheet-like view of the data frame.
View(gap)
However understanding this data frame in R should be less a matter of using a graphical interface, and more about using a variety of R functions to interrogate it.
nrow(gap)
## [1] 1704
ncol(gap)
## [1] 6
colnames(gap)
## [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
class(gap)
## [1] "tbl_df" "tbl" "data.frame"
typeof(gap)
## [1] "list"
summary(gap)
## country continent year lifeExp
## Length:1704 Length:1704 Min. :1952 Min. :23.60
## Class :character Class :character 1st Qu.:1966 1st Qu.:48.20
## Mode :character Mode :character Median :1980 Median :60.71
## Mean :1980 Mean :59.47
## 3rd Qu.:1993 3rd Qu.:70.85
## Max. :2007 Max. :82.60
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
A data frame can also be created from vectors, with the data_frame
function. (See also data.frame
in base R.) For example:
data_frame(foo=c(10,20,30), bar=c("a","b","c"))
## # A tibble: 3 x 2
## foo bar
## <dbl> <chr>
## 1 10 a
## 2 20 b
## 3 30 c
A data frame has column names (colnames
), and base R data frames can also have row names (rownames
). However the modern convention, which the Tidyverse enforces, is for a data frame to use column names but not row names. Typically a data frame contains a collection of items (rows), each having various properties (columns). If an item has an identifier such as a unique name, this would be given as just another column.
The count
function from dplyr
can help us understand the structure of this data frame. (See also table
in base R.) count
is a little magical, we can refer to columns of the data frame directly in the arguments to count
.
count(gap, year)
## # A tibble: 12 x 2
## year n
## <int> <int>
## 1 1952 142
## 2 1957 142
## 3 1962 142
## 4 1967 142
## 5 1972 142
## 6 1977 142
## 7 1982 142
## 8 1987 142
## 9 1992 142
## 10 1997 142
## 11 2002 142
## 12 2007 142
count(gap, country)
## # A tibble: 142 x 2
## country n
## <chr> <int>
## 1 Afghanistan 12
## 2 Albania 12
## 3 Algeria 12
## 4 Angola 12
## 5 Argentina 12
## 6 Australia 12
## 7 Austria 12
## 8 Bahrain 12
## 9 Bangladesh 12
## 10 Belgium 12
## # ... with 132 more rows
count(count(gap, country), n)
## # A tibble: 1 x 2
## n nn
## <int> <int>
## 1 12 142
There is data from 142 countries at 12 time points. The data is complete, with no missing values.
Data frames can be subset using [row,column]
syntax.
gap[3,4]
## # A tibble: 1 x 1
## lifeExp
## <dbl>
## 1 31.997
Note that this is still wrapped in a data frame. (This is a behaviour specific to Tidyverse data frames.)
Columns can be given by name.
gap[3, "lifeExp"]
## # A tibble: 1 x 1
## lifeExp
## <dbl>
## 1 31.997
The row or column may be omitted, thereby retrieving the full row or column.
gap[3,]
## # A tibble: 1 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1962 31.997 10267083 853.1007
Multiple rows or columns may be retrieved using a vector.
rows_wanted <- c(1,3,5)
gap[rows_wanted,]
## # A tibble: 3 x 6
## country continent year lifeExp pop gdpPercap
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
## 2 Afghanistan Asia 1962 31.997 10267083 853.1007
## 3 Afghanistan Asia 1972 36.088 13079460 739.9811
Ok, so how do we actually get data out of a data frame?
Under the hood, a data frame is a list of column vectors. This is why typeof
told us that gap
was a list. We can use $
to retrieve columns, as in a list. (Occasionally it is also useful to use [[ ]]
to retrieve columns, for example if the column name we want is stored in a variable.)
head( gap$lifeExp )
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
head( gap[["lifeExp"]] )
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
To get the lifeExp value of the third row as above, but unwrapped, we can use:
gap$lifeExp[3]
## [1] 31.997
All of these indexing and access methods can also be used with <-
to modify values or add new columns. For example, suppose we wanted a GDP column. We can add this to the data frame with:
gap$gdp <- gap$gdpPercap * gap$pop
A method of indexing that we haven’t discussed yet is logical indexing. Instead of specifying the row number or numbers that we want, we can give a logical vector which is TRUE
for the rows we want and FALSE
otherwise. This can also be used with vectors.
Suppose we want just the data for Australia. ==
is a comparison operator meaning “equal to”.
is_australia <- gap$country == "Australia"
head(is_australia)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
sum(is_australia)
## [1] 12
We can now grab just those rows of the data frame relating to Australia:
gap_australia <- gap[is_australia,]
gap_australia
## # A tibble: 12 x 7
## country continent year lifeExp pop gdpPercap gdp
## <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Australia Oceania 1952 69.120 8691212 10039.60 87256254102
## 2 Australia Oceania 1957 70.330 9712569 10949.65 106349227169
## 3 Australia Oceania 1962 70.930 10794968 12217.23 131884573002
## 4 Australia Oceania 1967 71.100 11872264 14526.12 172457986742
## 5 Australia Oceania 1972 71.930 13177000 16788.63 221223770658
## 6 Australia Oceania 1977 73.490 14074100 18334.20 258037329175
## 7 Australia Oceania 1982 74.740 15184200 19477.01 295742804309
## 8 Australia Oceania 1987 76.320 16257249 21888.89 355853119294
## 9 Australia Oceania 1992 77.560 17481977 23424.77 409511234952
## 10 Australia Oceania 1997 78.830 18565243 26997.94 501223252921
## 11 Australia Oceania 2002 80.370 19546792 30687.75 599847158654
## 12 Australia Oceania 2007 81.235 20434176 34435.37 703658358894
We might also want to know which rows our logical vector is TRUE
for. This is achieved with the which
function. The result of this can also be used to index the data frame, as we saw earlier.
which_australia <- which(is_australia)
which_australia
## [1] 61 62 63 64 65 66 67 68 69 70 71 72
gap[which_australia,]
## # A tibble: 12 x 7
## country continent year lifeExp pop gdpPercap gdp
## <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Australia Oceania 1952 69.120 8691212 10039.60 87256254102
## 2 Australia Oceania 1957 70.330 9712569 10949.65 106349227169
## 3 Australia Oceania 1962 70.930 10794968 12217.23 131884573002
## 4 Australia Oceania 1967 71.100 11872264 14526.12 172457986742
## 5 Australia Oceania 1972 71.930 13177000 16788.63 221223770658
## 6 Australia Oceania 1977 73.490 14074100 18334.20 258037329175
## 7 Australia Oceania 1982 74.740 15184200 19477.01 295742804309
## 8 Australia Oceania 1987 76.320 16257249 21888.89 355853119294
## 9 Australia Oceania 1992 77.560 17481977 23424.77 409511234952
## 10 Australia Oceania 1997 78.830 18565243 26997.94 501223252921
## 11 Australia Oceania 2002 80.370 19546792 30687.75 599847158654
## 12 Australia Oceania 2007 81.235 20434176 34435.37 703658358894
Comparison operators available are:
x == y
– “equal to”x != y
– “not equal to”x < y
– “less than”x > y
– “greater than”x <= y
– “less than or equal to”x >= y
– “greater than or equal to”More complicated conditions can be constructed using logical operators:
a & b
– “and”, true only if both a
and b
are true.a | b
– “or”, true if either a
or b
or both are true.! a
– “not” , true if a
is false, and false if a
is true.For example, suppose we wanted to know in which years the life expectancy in Australia was over 75.
over_75 <- gap$lifeExp >= 75
is_australia_over_75 <- is_australia & over_75
sum(is_australia_over_75)
## [1] 5
gap[is_australia_over_75,]
## # A tibble: 5 x 7
## country continent year lifeExp pop gdpPercap gdp
## <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Australia Oceania 1987 76.320 16257249 21888.89 355853119294
## 2 Australia Oceania 1992 77.560 17481977 23424.77 409511234952
## 3 Australia Oceania 1997 78.830 18565243 26997.94 501223252921
## 4 Australia Oceania 2002 80.370 19546792 30687.75 599847158654
## 5 Australia Oceania 2007 81.235 20434176 34435.37 703658358894
What continents are the countries divided into in this data?
Which countries in Asia had a life expectancy over 75 in 2007?
dplyr
shorthandThe above method is a little laborious. We have to keep mentioning the name of the data frame, and there is a lot of punctuation to keep track of. dplyr
provides a slightly magical function called filter
which lets us write more concisely.
filter(gap, country == "Australia")
## # A tibble: 12 x 7
## country continent year lifeExp pop gdpPercap gdp
## <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Australia Oceania 1952 69.120 8691212 10039.60 87256254102
## 2 Australia Oceania 1957 70.330 9712569 10949.65 106349227169
## 3 Australia Oceania 1962 70.930 10794968 12217.23 131884573002
## 4 Australia Oceania 1967 71.100 11872264 14526.12 172457986742
## 5 Australia Oceania 1972 71.930 13177000 16788.63 221223770658
## 6 Australia Oceania 1977 73.490 14074100 18334.20 258037329175
## 7 Australia Oceania 1982 74.740 15184200 19477.01 295742804309
## 8 Australia Oceania 1987 76.320 16257249 21888.89 355853119294
## 9 Australia Oceania 1992 77.560 17481977 23424.77 409511234952
## 10 Australia Oceania 1997 78.830 18565243 26997.94 501223252921
## 11 Australia Oceania 2002 80.370 19546792 30687.75 599847158654
## 12 Australia Oceania 2007 81.235 20434176 34435.37 703658358894
In the second argument, we are able to refer to columns of the data frame as though they were variables.
We constructed gap[is_australia_over_75,]
by creating several variables and then combining them. It is also perfectly possible to do all this in one line:
gap[gap$country == "Australia" & gap$lifeExp >= 75,]
## # A tibble: 5 x 7
## country continent year lifeExp pop gdpPercap gdp
## <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Australia Oceania 1987 76.320 16257249 21888.89 355853119294
## 2 Australia Oceania 1992 77.560 17481977 23424.77 409511234952
## 3 Australia Oceania 1997 78.830 18565243 26997.94 501223252921
## 4 Australia Oceania 2002 80.370 19546792 30687.75 599847158654
## 5 Australia Oceania 2007 81.235 20434176 34435.37 703658358894
If you encounter R code that is too difficult to read, it can often be broken down into multiple steps, with intermediate results stored in variables.
However we can’t do this with calls to the filter
function, since it is magic. More precisely, it uses something called “non-standard evaluation”. For example, this doesn’t work:
is_australia <- year == "Australia"
filter(gap, is_australia)
Simple plots can be created using the plot
function.
plot(gap_australia$year, gap_australia$lifeExp)
However we will see a much more flexible way of plotting in the final section of this workshop.
Data frames can be sorted using the arrange
function in dplyr
.
arrange(gap, country)
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap gdp
## <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330
## 2 Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670
## 3 Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797
## 4 Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150
## 5 Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274
## 6 Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231
## 7 Afghanistan Asia 1982 39.854 12881816 978.0114 12598563401
## 8 Afghanistan Asia 1987 40.822 13867957 852.3959 11820990309
## 9 Afghanistan Asia 1992 41.674 16317921 649.3414 10595901589
## 10 Afghanistan Asia 1997 41.763 22227415 635.3414 14121995875
## # ... with 1,694 more rows
The desc
helper function can be used to arrange in descending order.
arrange(gap, desc(country))
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap gdp
## <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
## 1 Zimbabwe Africa 1952 48.451 3080907 406.8841 1253572117
## 2 Zimbabwe Africa 1957 50.469 3646340 518.7643 1891590901
## 3 Zimbabwe Africa 1962 52.358 4277736 527.2722 2255531194
## 4 Zimbabwe Africa 1967 53.995 4995432 569.7951 2846372532
## 5 Zimbabwe Africa 1972 55.635 5861135 799.3622 4685169626
## 6 Zimbabwe Africa 1977 57.674 6642107 685.5877 4553746742
## 7 Zimbabwe Africa 1982 60.363 7636524 788.8550 6024110454
## 8 Zimbabwe Africa 1987 62.351 9216418 706.1573 6508240905
## 9 Zimbabwe Africa 1992 60.377 10704340 693.4208 7422611852
## 10 Zimbabwe Africa 1997 46.809 11404948 792.4500 9037850590
## # ... with 1,694 more rows
Which country had the lowest life expectancy in 1952? Which had the highest?
R has a variety of functions for summarizing a vector, including: sum
, mean
, min
, max
, median
, sd
.
mean( c(1,2,3,4) )
## [1] 2.5
We can use this on the Gapminder data.
gap2007 <- filter(gap, year == 2007)
mean(gap2007$lifeExp)
## [1] 67.00742
(Possibly this should be a weighted.mean
, as countries have different populations, but let’s skip this detail.)
The summarize
function in dplyr
allows these to be applied to data frames.
summarize(gap2007, mean_lifeExp=mean(lifeExp))
## # A tibble: 1 x 1
## mean_lifeExp
## <dbl>
## 1 67.00742
So far unremarkable, but summarize
comes into its own when the group_by
“adjective” is used. (See also apply
, tapply
in base R.)
summarize(group_by(gap, year), mean_lifeExp=mean(lifeExp))
## # A tibble: 12 x 2
## year mean_lifeExp
## <int> <dbl>
## 1 1952 49.05762
## 2 1957 51.50740
## 3 1962 53.60925
## 4 1967 55.67829
## 5 1972 57.64739
## 6 1977 59.57016
## 7 1982 61.53320
## 8 1987 63.21261
## 9 1992 64.16034
## 10 1997 65.01468
## 11 2002 65.69492
## 12 2007 67.00742
What is the average of gdpPercap for each continent in 2007?
Advanced: What is the total GDP and total population for each continent in 2007? Therefore, what is the correct GDP per capita for each continent?
We will finish this section by demonstrating a t-test as an example of statistical tests available in R.
Has life expectancy increased from 2002 to 2007?
gap2002 <- filter(gap, year == 2002)
gap2007 <- filter(gap, year == 2007)
t.test(gap2007$lifeExp, gap2002$lifeExp)
##
## Welch Two Sample t-test
##
## data: gap2007$lifeExp and gap2002$lifeExp
## t = 0.90822, df = 281.92, p-value = 0.3645
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.53211 4.15711
## sample estimates:
## mean of x mean of y
## 67.00742 65.69492
This can actually be considered a paired sample t-test. We can specify paired=TRUE
to t.test
to perform a paired sample t-test (check this by looking at the help page with ?t.test
). It’s important to first check that both data frames are in the same order.
all(gap2002$country == gap2007$country)
## [1] TRUE
t.test(gap2007$lifeExp, gap2002$lifeExp, paired=TRUE)
##
## Paired t-test
##
## data: gap2007$lifeExp and gap2002$lifeExp
## t = 14.665, df = 141, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.135561 1.489439
## sample estimates:
## mean of the differences
## 1.3125
When performing a statistical test, it’s good practice to visualize the data to make sure there is nothing funny going on.
plot(gap2002$lifeExp, gap2007$lifeExp)
abline(0,1)
The result of a t-test is actually a value we can manipulate further.
result <- t.test(gap2007$lifeExp, gap2002$lifeExp, paired=TRUE)
class(result)
## [1] "htest"
typeof(result)
## [1] "list"
names(result)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
result$p.value
## [1] 3.738317e-30
In R, a t-test is just another function returning just another type of data, so it can also be a building block.
Missing data, which R represents as NA
.
Factors are a type of vector for categorical data. Factors are similar to character vectors but with an associated ordered set of “levels”. See the factor
function. It may be necessary to convert character vectors into factors, for example to adjust the order of levels when they are displayed in a plot.
Matrices are similar to data frames, but the columns all contain the same type of data. Conceptually, in a data frame each observation is a row, but in a matrix each observation is a cell in the matrix. See the matrix
and as.matrix
functions. These are used in bioinformatics, for example RNA-Seq results summarized as a matrix of read counts associated with genes (rows) and samples (columns).
Joining data frames together. See functions such as left_join
and bind_rows
.
These are covered in our full day introductory course.
We also haven’t covered programming topics such as writing functions and for-loops. These are covered in our “more R” course.
Finally, we have not yet told the full dplyr
story. For example, we haven’t mentioned the pipe %>%
, which is key to writing elegant dplyr
code. The R for Data Science book provides a more complete introduction to dplyr
.