“Data frame” is R’s name for tabular data. We generally want each row in a data frame to represent a unit of observation, and each column to contain a different type of information about the units of observation. Tabular data in this form is called “tidy data”.

This section uses a collection of modern packages collectively known as the Tidyverse. R and its predecessor S have a history dating back to 1976. The Tidyverse fixes some dubious design decisions baked into “base R”, including having its own slightly improved form of data frame. Sticking to the Tidyverse where possible is generally safer, Tidyverse packages are more willing to generate errors rather than ignore problems.

If the Tidyverse is not already installed, you will need to install it. However on the server we are using today it is already installed.

# install.packages("tidyverse")

We need to load this package in order to use it.

library(tidyverse)

The tidyverse package loads various other packages, setting up a modern R environment. In this section we will be using functions from the readr and dplyr packages.

R is a language with mini-languages within it that solve specific problem domains. dplyr is such a mini-language, a set of “verbs” (functions) that work well together. dplyr, with the help of tidyr for some more complex operations, provides a way to perform most manipulations on a data frame that you might need.

Loading data

We will use the read_csv function from readr. (See also read.csv in base R.)

gap <- read_csv("gapminder.csv")
## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_integer(),
##   lifeExp = col_double(),
##   pop = col_integer(),
##   gdpPercap = col_double()
## )
gap
## # A tibble: 1,704 x 6
##        country continent  year lifeExp      pop gdpPercap
##          <chr>     <chr> <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan      Asia  1952  28.801  8425333  779.4453
##  2 Afghanistan      Asia  1957  30.332  9240934  820.8530
##  3 Afghanistan      Asia  1962  31.997 10267083  853.1007
##  4 Afghanistan      Asia  1967  34.020 11537966  836.1971
##  5 Afghanistan      Asia  1972  36.088 13079460  739.9811
##  6 Afghanistan      Asia  1977  38.438 14880372  786.1134
##  7 Afghanistan      Asia  1982  39.854 12881816  978.0114
##  8 Afghanistan      Asia  1987  40.822 13867957  852.3959
##  9 Afghanistan      Asia  1992  41.674 16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414
## # ... with 1,694 more rows

This is data from Gapminder on life expectancy over time in different countries. The “unit of observation” is a country in a particular year.

Note: “tibble” refers to the Tidyverse’s improved form of data frame.

Exploring

The View function gives us a spreadsheet-like view of the data frame.

View(gap)

However understanding this data frame in R should be less a matter of using a graphical interface, and more about using a variety of R functions to interrogate it.

nrow(gap)
## [1] 1704
ncol(gap)
## [1] 6
colnames(gap)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
class(gap)
## [1] "tbl_df"     "tbl"        "data.frame"
typeof(gap)
## [1] "list"
summary(gap)
##    country           continent              year         lifeExp     
##  Length:1704        Length:1704        Min.   :1952   Min.   :23.60  
##  Class :character   Class :character   1st Qu.:1966   1st Qu.:48.20  
##  Mode  :character   Mode  :character   Median :1980   Median :60.71  
##                                        Mean   :1980   Mean   :59.47  
##                                        3rd Qu.:1993   3rd Qu.:70.85  
##                                        Max.   :2007   Max.   :82.60  
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1

Tip

A data frame can also be created from vectors, with the data_frame function. (See also data.frame in base R.) For example:

data_frame(foo=c(10,20,30), bar=c("a","b","c"))
## # A tibble: 3 x 2
##     foo   bar
##   <dbl> <chr>
## 1    10     a
## 2    20     b
## 3    30     c

Tip

A data frame has column names (colnames), and base R data frames can also have row names (rownames). However the modern convention, which the Tidyverse enforces, is for a data frame to use column names but not row names. Typically a data frame contains a collection of items (rows), each having various properties (columns). If an item has an identifier such as a unique name, this would be given as just another column.

The count function from dplyr can help us understand the structure of this data frame. (See also table in base R.) count is a little magical, we can refer to columns of the data frame directly in the arguments to count.

count(gap, year)
## # A tibble: 12 x 2
##     year     n
##    <int> <int>
##  1  1952   142
##  2  1957   142
##  3  1962   142
##  4  1967   142
##  5  1972   142
##  6  1977   142
##  7  1982   142
##  8  1987   142
##  9  1992   142
## 10  1997   142
## 11  2002   142
## 12  2007   142
count(gap, country)
## # A tibble: 142 x 2
##        country     n
##          <chr> <int>
##  1 Afghanistan    12
##  2     Albania    12
##  3     Algeria    12
##  4      Angola    12
##  5   Argentina    12
##  6   Australia    12
##  7     Austria    12
##  8     Bahrain    12
##  9  Bangladesh    12
## 10     Belgium    12
## # ... with 132 more rows
count(count(gap, country), n)
## # A tibble: 1 x 2
##       n    nn
##   <int> <int>
## 1    12   142

There is data from 142 countries at 12 time points. The data is complete, with no missing values.

Indexing

Data frames can be subset using [row,column] syntax.

gap[3,4]
## # A tibble: 1 x 1
##   lifeExp
##     <dbl>
## 1  31.997

Note that this is still wrapped in a data frame. (This is a behaviour specific to Tidyverse data frames.)

Columns can be given by name.

gap[3, "lifeExp"]
## # A tibble: 1 x 1
##   lifeExp
##     <dbl>
## 1  31.997

The row or column may be omitted, thereby retrieving the full row or column.

gap[3,]
## # A tibble: 1 x 6
##       country continent  year lifeExp      pop gdpPercap
##         <chr>     <chr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1962  31.997 10267083  853.1007

Multiple rows or columns may be retrieved using a vector.

rows_wanted <- c(1,3,5)
gap[rows_wanted,]
## # A tibble: 3 x 6
##       country continent  year lifeExp      pop gdpPercap
##         <chr>     <chr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 3 Afghanistan      Asia  1972  36.088 13079460  739.9811

Ok, so how do we actually get data out of a data frame?

Under the hood, a data frame is a list of column vectors. This is why typeof told us that gap was a list. We can use $ to retrieve columns, as in a list. (Occasionally it is also useful to use [[ ]] to retrieve columns, for example if the column name we want is stored in a variable.)

head( gap$lifeExp )
## [1] 28.801 30.332 31.997 34.020 36.088 38.438
head( gap[["lifeExp"]] )
## [1] 28.801 30.332 31.997 34.020 36.088 38.438

To get the lifeExp value of the third row as above, but unwrapped, we can use:

gap$lifeExp[3]
## [1] 31.997

All of these indexing and access methods can also be used with <- to modify values or add new columns. For example, suppose we wanted a GDP column. We can add this to the data frame with:

gap$gdp <- gap$gdpPercap * gap$pop

Logical indexing

A method of indexing that we haven’t discussed yet is logical indexing. Instead of specifying the row number or numbers that we want, we can give a logical vector which is TRUE for the rows we want and FALSE otherwise. This can also be used with vectors.

Suppose we want just the data for Australia. == is a comparison operator meaning “equal to”.

is_australia <- gap$country == "Australia"

head(is_australia)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
sum(is_australia)
## [1] 12

We can now grab just those rows of the data frame relating to Australia:

gap_australia <- gap[is_australia,]

gap_australia
## # A tibble: 12 x 7
##      country continent  year lifeExp      pop gdpPercap          gdp
##        <chr>     <chr> <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Australia   Oceania  1952  69.120  8691212  10039.60  87256254102
##  2 Australia   Oceania  1957  70.330  9712569  10949.65 106349227169
##  3 Australia   Oceania  1962  70.930 10794968  12217.23 131884573002
##  4 Australia   Oceania  1967  71.100 11872264  14526.12 172457986742
##  5 Australia   Oceania  1972  71.930 13177000  16788.63 221223770658
##  6 Australia   Oceania  1977  73.490 14074100  18334.20 258037329175
##  7 Australia   Oceania  1982  74.740 15184200  19477.01 295742804309
##  8 Australia   Oceania  1987  76.320 16257249  21888.89 355853119294
##  9 Australia   Oceania  1992  77.560 17481977  23424.77 409511234952
## 10 Australia   Oceania  1997  78.830 18565243  26997.94 501223252921
## 11 Australia   Oceania  2002  80.370 19546792  30687.75 599847158654
## 12 Australia   Oceania  2007  81.235 20434176  34435.37 703658358894

We might also want to know which rows our logical vector is TRUE for. This is achieved with the which function. The result of this can also be used to index the data frame, as we saw earlier.

which_australia <- which(is_australia)
which_australia
##  [1] 61 62 63 64 65 66 67 68 69 70 71 72
gap[which_australia,]
## # A tibble: 12 x 7
##      country continent  year lifeExp      pop gdpPercap          gdp
##        <chr>     <chr> <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Australia   Oceania  1952  69.120  8691212  10039.60  87256254102
##  2 Australia   Oceania  1957  70.330  9712569  10949.65 106349227169
##  3 Australia   Oceania  1962  70.930 10794968  12217.23 131884573002
##  4 Australia   Oceania  1967  71.100 11872264  14526.12 172457986742
##  5 Australia   Oceania  1972  71.930 13177000  16788.63 221223770658
##  6 Australia   Oceania  1977  73.490 14074100  18334.20 258037329175
##  7 Australia   Oceania  1982  74.740 15184200  19477.01 295742804309
##  8 Australia   Oceania  1987  76.320 16257249  21888.89 355853119294
##  9 Australia   Oceania  1992  77.560 17481977  23424.77 409511234952
## 10 Australia   Oceania  1997  78.830 18565243  26997.94 501223252921
## 11 Australia   Oceania  2002  80.370 19546792  30687.75 599847158654
## 12 Australia   Oceania  2007  81.235 20434176  34435.37 703658358894

Comparison operators available are:

  • x == y – “equal to”
  • x != y – “not equal to”
  • x < y – “less than”
  • x > y – “greater than”
  • x <= y – “less than or equal to”
  • x >= y – “greater than or equal to”

More complicated conditions can be constructed using logical operators:

  • a & b – “and”, true only if both a and b are true.
  • a | b – “or”, true if either a or b or both are true.
  • ! a – “not” , true if a is false, and false if a is true.

For example, suppose we wanted to know in which years the life expectancy in Australia was over 75.

over_75 <- gap$lifeExp >= 75
is_australia_over_75 <- is_australia & over_75

sum(is_australia_over_75)
## [1] 5
gap[is_australia_over_75,]
## # A tibble: 5 x 7
##     country continent  year lifeExp      pop gdpPercap          gdp
##       <chr>     <chr> <int>   <dbl>    <int>     <dbl>        <dbl>
## 1 Australia   Oceania  1987  76.320 16257249  21888.89 355853119294
## 2 Australia   Oceania  1992  77.560 17481977  23424.77 409511234952
## 3 Australia   Oceania  1997  78.830 18565243  26997.94 501223252921
## 4 Australia   Oceania  2002  80.370 19546792  30687.75 599847158654
## 5 Australia   Oceania  2007  81.235 20434176  34435.37 703658358894

Challenge

What continents are the countries divided into in this data?

Which countries in Asia had a life expectancy over 75 in 2007?

A dplyr shorthand

The above method is a little laborious. We have to keep mentioning the name of the data frame, and there is a lot of punctuation to keep track of. dplyr provides a slightly magical function called filter which lets us write more concisely.

filter(gap, country == "Australia")
## # A tibble: 12 x 7
##      country continent  year lifeExp      pop gdpPercap          gdp
##        <chr>     <chr> <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Australia   Oceania  1952  69.120  8691212  10039.60  87256254102
##  2 Australia   Oceania  1957  70.330  9712569  10949.65 106349227169
##  3 Australia   Oceania  1962  70.930 10794968  12217.23 131884573002
##  4 Australia   Oceania  1967  71.100 11872264  14526.12 172457986742
##  5 Australia   Oceania  1972  71.930 13177000  16788.63 221223770658
##  6 Australia   Oceania  1977  73.490 14074100  18334.20 258037329175
##  7 Australia   Oceania  1982  74.740 15184200  19477.01 295742804309
##  8 Australia   Oceania  1987  76.320 16257249  21888.89 355853119294
##  9 Australia   Oceania  1992  77.560 17481977  23424.77 409511234952
## 10 Australia   Oceania  1997  78.830 18565243  26997.94 501223252921
## 11 Australia   Oceania  2002  80.370 19546792  30687.75 599847158654
## 12 Australia   Oceania  2007  81.235 20434176  34435.37 703658358894

In the second argument, we are able to refer to columns of the data frame as though they were variables.

Different ways to do the same thing

We constructed gap[is_australia_over_75,] by creating several variables and then combining them. It is also perfectly possible to do all this in one line:

gap[gap$country == "Australia" & gap$lifeExp >= 75,]
## # A tibble: 5 x 7
##     country continent  year lifeExp      pop gdpPercap          gdp
##       <chr>     <chr> <int>   <dbl>    <int>     <dbl>        <dbl>
## 1 Australia   Oceania  1987  76.320 16257249  21888.89 355853119294
## 2 Australia   Oceania  1992  77.560 17481977  23424.77 409511234952
## 3 Australia   Oceania  1997  78.830 18565243  26997.94 501223252921
## 4 Australia   Oceania  2002  80.370 19546792  30687.75 599847158654
## 5 Australia   Oceania  2007  81.235 20434176  34435.37 703658358894

If you encounter R code that is too difficult to read, it can often be broken down into multiple steps, with intermediate results stored in variables.

However we can’t do this with calls to the filter function, since it is magic. More precisely, it uses something called “non-standard evaluation”. For example, this doesn’t work:

is_australia <- year == "Australia"
filter(gap, is_australia)

Basic plotting

Simple plots can be created using the plot function.

plot(gap_australia$year, gap_australia$lifeExp)

However we will see a much more flexible way of plotting in the final section of this workshop.

Sorting

Data frames can be sorted using the arrange function in dplyr.

arrange(gap, country)
## # A tibble: 1,704 x 7
##        country continent  year lifeExp      pop gdpPercap         gdp
##          <chr>     <chr> <int>   <dbl>    <int>     <dbl>       <dbl>
##  1 Afghanistan      Asia  1952  28.801  8425333  779.4453  6567086330
##  2 Afghanistan      Asia  1957  30.332  9240934  820.8530  7585448670
##  3 Afghanistan      Asia  1962  31.997 10267083  853.1007  8758855797
##  4 Afghanistan      Asia  1967  34.020 11537966  836.1971  9648014150
##  5 Afghanistan      Asia  1972  36.088 13079460  739.9811  9678553274
##  6 Afghanistan      Asia  1977  38.438 14880372  786.1134 11697659231
##  7 Afghanistan      Asia  1982  39.854 12881816  978.0114 12598563401
##  8 Afghanistan      Asia  1987  40.822 13867957  852.3959 11820990309
##  9 Afghanistan      Asia  1992  41.674 16317921  649.3414 10595901589
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414 14121995875
## # ... with 1,694 more rows

The desc helper function can be used to arrange in descending order.

arrange(gap, desc(country))
## # A tibble: 1,704 x 7
##     country continent  year lifeExp      pop gdpPercap        gdp
##       <chr>     <chr> <int>   <dbl>    <int>     <dbl>      <dbl>
##  1 Zimbabwe    Africa  1952  48.451  3080907  406.8841 1253572117
##  2 Zimbabwe    Africa  1957  50.469  3646340  518.7643 1891590901
##  3 Zimbabwe    Africa  1962  52.358  4277736  527.2722 2255531194
##  4 Zimbabwe    Africa  1967  53.995  4995432  569.7951 2846372532
##  5 Zimbabwe    Africa  1972  55.635  5861135  799.3622 4685169626
##  6 Zimbabwe    Africa  1977  57.674  6642107  685.5877 4553746742
##  7 Zimbabwe    Africa  1982  60.363  7636524  788.8550 6024110454
##  8 Zimbabwe    Africa  1987  62.351  9216418  706.1573 6508240905
##  9 Zimbabwe    Africa  1992  60.377 10704340  693.4208 7422611852
## 10 Zimbabwe    Africa  1997  46.809 11404948  792.4500 9037850590
## # ... with 1,694 more rows

Challenge

Which country had the lowest life expectancy in 1952? Which had the highest?

Summaries

R has a variety of functions for summarizing a vector, including: sum, mean, min, max, median, sd.

mean( c(1,2,3,4) )
## [1] 2.5

We can use this on the Gapminder data.

gap2007 <- filter(gap, year == 2007)
mean(gap2007$lifeExp)
## [1] 67.00742

(Possibly this should be a weighted.mean, as countries have different populations, but let’s skip this detail.)

The summarize function in dplyr allows these to be applied to data frames.

summarize(gap2007, mean_lifeExp=mean(lifeExp))
## # A tibble: 1 x 1
##   mean_lifeExp
##          <dbl>
## 1     67.00742

So far unremarkable, but summarize comes into its own when the group_by “adjective” is used. (See also apply, tapply in base R.)

summarize(group_by(gap, year), mean_lifeExp=mean(lifeExp))
## # A tibble: 12 x 2
##     year mean_lifeExp
##    <int>        <dbl>
##  1  1952     49.05762
##  2  1957     51.50740
##  3  1962     53.60925
##  4  1967     55.67829
##  5  1972     57.64739
##  6  1977     59.57016
##  7  1982     61.53320
##  8  1987     63.21261
##  9  1992     64.16034
## 10  1997     65.01468
## 11  2002     65.69492
## 12  2007     67.00742

Challenge

What is the average of gdpPercap for each continent in 2007?

Advanced: What is the total GDP and total population for each continent in 2007? Therefore, what is the correct GDP per capita for each continent?

t-test

We will finish this section by demonstrating a t-test as an example of statistical tests available in R.

Has life expectancy increased from 2002 to 2007?

gap2002 <- filter(gap, year == 2002)
gap2007 <- filter(gap, year == 2007)

t.test(gap2007$lifeExp, gap2002$lifeExp)
## 
##  Welch Two Sample t-test
## 
## data:  gap2007$lifeExp and gap2002$lifeExp
## t = 0.90822, df = 281.92, p-value = 0.3645
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.53211  4.15711
## sample estimates:
## mean of x mean of y 
##  67.00742  65.69492

This can actually be considered a paired sample t-test. We can specify paired=TRUE to t.test to perform a paired sample t-test (check this by looking at the help page with ?t.test). It’s important to first check that both data frames are in the same order.

all(gap2002$country == gap2007$country)
## [1] TRUE
t.test(gap2007$lifeExp, gap2002$lifeExp, paired=TRUE)
## 
##  Paired t-test
## 
## data:  gap2007$lifeExp and gap2002$lifeExp
## t = 14.665, df = 141, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.135561 1.489439
## sample estimates:
## mean of the differences 
##                  1.3125

When performing a statistical test, it’s good practice to visualize the data to make sure there is nothing funny going on.

plot(gap2002$lifeExp, gap2007$lifeExp)
abline(0,1)

The result of a t-test is actually a value we can manipulate further.

result <- t.test(gap2007$lifeExp, gap2002$lifeExp, paired=TRUE)

class(result)
## [1] "htest"
typeof(result)
## [1] "list"
names(result)
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"
result$p.value
## [1] 3.738317e-30

In R, a t-test is just another function returning just another type of data, so it can also be a building block.

Some topics not covered

These are covered in our full day introductory course.

We also haven’t covered programming topics such as writing functions and for-loops. These are covered in our “more R” course.

Finally, we have not yet told the full dplyr story. For example, we haven’t mentioned the pipe %>%, which is key to writing elegant dplyr code. The R for Data Science book provides a more complete introduction to dplyr.


Home