Chapter 2 Data Manipulation

In this chapter we load and transform data. That is a standard step and probably the step taking the most time, when analyzing data. Preparing data is necessary, since data analysis requires data in a certain format. Which format depends on the model. I will introduce you to dplyr, the standard package for manipulating data. At first, it seems a bit complicated, but I guarantee you, if you do data analyses in R, you will have to work with it a lot and you will become quite fluent in it. The goal of this chapter is to teach you how to manipulate data so that we can bring it into the format we need it to analyse it properly. Have fun!

2.1 Packages

This far, we’ve only covered so-called “base R functions” or “built-in functions”, but R has an active community and sometimes further operations are needed, so we use packages. These are including further functions, which we will use heavily in the following section.

2.1.1 The `tidyverse` package

One of the most influential and widely used package in R is the tidyverse package. This package includes several other packages, which are key for data manipulation e.g. dplyr, ggplot2, stringr, readr, tidyr.

2.2 Working with packages

2.2.1 Installing packages

To install packages you use the very creative install.packages() command in R. Note that it is necessary to directly install a package in R. This step is only required once:

Call the install.packages() command.
You put the name of the package in quotation marks into the function.

install.packages("pacman")

2.2.2 Loading your packages

While you only need to install a package once, you need to load it every time in your script, when you open it. You can do that with the library() function in R:

Call the library() function.
Put the name of

library(pacman)

It is always important to have an efficient workflow in R. Traditional R users, load all packages they need at the beginning of their page. Logically, so they just need to go back to the top of the script and need to load it every time they open the script. But there are way more elegant and pragmatic ways to do that.

One way is the pacman package:

First, if you use the name of package and put a “::” behind it you tell R to go into the package and to specifically get one command of the package, in our case the p_load command.
Second, the p_load command, loads the packages in the brackets and checks if they are installed, if not, it automatically installs and loads them.

pacman::p_load("tidyverse", "psych", "gapminder") #loading packages

2.3 The Data we will work with: The European Social Survey (ESS)

For the following Data Manipulation Part, we will use the European Social Survey Round 10 with the topic “Democracy, Digital social contacts”. It is a high-quality survey conducted in 31 European countries. Round 10 was conducted in 2020 and is the most recent ESS. We will use it, since survey data is quite popular among students and further at some point everyone needs to work with it. But do not worry if you do not like survey data, we will also cover other prominent Datasets.

You can freely download it via the website of the ESS. From there you need to go to the Data Portal and than you can download the Round you want, in the format you want. As already mentioned, use the .dta or .csv format.

Variable	Description	Scales
idnt	Respondent’s identification number	unique number from 1-9000
year	The year when the survey was conducted	only 2020
cntry	Country	BE, BG, CH, CZ, EE, FI, FR,GB, GR, HR, HU, IE, IS, IT, LT,NL, NO, PT, SI, SK
agea	Age of the Respondent, calculated	Number of Age = 15-90 999 = Not available
gndr	Gender	1 = Male; 2 = Female; 9 = No answer
happy	How happy are you	0 (Extremly unhappy) - 10 (Extremly happy); 77 = Refusal; 88 = Don’t Know; 99 = No answer
eisced	Highest level of education, ES - ISCED	0 = Not possible to harmonise into ES-ISCED; 1 (ES-ISCED I , less than lower secondary) - 7 (ES-ISCED V2, higher tertiary education, => MA level; 55 = Other; 77 = Refusal; 88 = Don’t know; 99 = No answer
netusoft	Internet use, how often	1 (Never) - 5 (Every day); 7 = Refusal; 8 = Don’t know; 9 = No answer
trstprl	Most people can be trusted or you can’t be too careful	0 (You can’t be too careful) - 10 (Most people can be trusted); 77 = Refusal; 88 = Don’t Know; 99 = No answer
lrscale	Left-Right Placement	0 (Left) - 10 (Right); 77 = Refusal; 88 = Don’t know; 99 = No answer
dscrgrp	Would you describe yourself as being a member of a group that is discriminated against in this country?	1 = Yes; 2 = No; 77 = Refusal; 88 = Don’t know; 99 = No answer

Note: In the following, I will simulate the ESS with the same names and same range values. Which means if you read in the actual ESS, you should be able to run the code as well without problems!

I do so for reproducibility reasons. When downloading the script, users should be able to run the whole script with one click. This is because users, who are new to R might have problems with setting working directories, since working directories are prone to errors, especially at the beginning.

2.3.1 Load the data

2.3.1.1 How to load data

The first thing to do, when cleaning data is to load the data into R. The first thing to ensure is to know, where R can take the data from and load it into its own environment. You can call data from your local devise or remote through e.g. an API:

Remote: This means the dataset is laying around on some external Cloud or Server and you have the possibility to load it into R, without downloading it on your local devise. Depending on the dataset there are different ways, I will show you later how to get data from the World Bank.
Local: To get data from your devise, you have to tell R, where to find it, meaning you tell him the path. Go to your file and click on it. The path will be shown in the address bar, just copy it and paste it in R with the responding command. The command is defined by the type of the file as you can see in the table below. Here is an example how it could look wit a .csv file

data <- read.csv("C:/Users/YourUsername/Documents/data.csv")

This table presents some of the most frequent data types, which you can download and load into R. There are of course more data types, and to find out if you can read them into R, you can just google the type and look for the command. Mostly you will find a package with a command. And mostly, those commands start with “read”.

Examples of Different Data Types and how to load them into R
File Stata	File Extension .dta	Package haven	Command `read_dta()`

CSV-Files	.csv	readr (is included in the tidyverse package)	`read_csv()`
Excel-Files	.xlsx; .xls	read_excel() \| `read.rds()`
RData (also RDS Data)	.RData;.rds	base R functions, no package required	`load()` , `read.rds()`

2.3.1.2 Working Directories and R-Projects

If you have more data and files you want to load into R, it is not recommended to copy always the path of every file. Instead it is common to work with working directories or R-Projects.

There is always a working directory you can find out, in which working directory you are currently at by using the command getwd(), just run it in your console and R will give you the path it currently is working at. Let us assume you saved all your data in a folder called “Intro_to_R_course”, then you can set the working directory separately with this command (do not forget to change the names in the paths):

setwd("C:/Users/YourUsername/Intro_to_R_course")
After you set the working directory you can run again getwd() and the path now has to be changed to the content of the setwd(). The advantage is now that if the working directory is set, you can load in the data without specifying the path. If your data is in the path of the working directory you can load like that:

data <- read.csv("data.csv")
You can also create an R-Project. If you click on file > New Project > New Directory > New Project you can determine the working directory and create a folder in it. In this folder, there will be an R-Project file, if you open this file, a blank R environment will appear. Now, you can create files in this folder. And every time you enter R through the R-Project file, you do not have to set any working directories. Because if the R-Projects opens R, it automatically sets the working directory to where you created the folder.
I would not recommend to work with R-Projects in the beginning, but when you work with collaborators and want to make your code reproducible for others, then you will have to work with R-Projects one day. Again, this also the reason, I just simulate all the data, so you can work with the code without worrying about any working directories.

2.3.2 One last thing: Pipelines

Sometimes codes have several dimensions, which could make it quite complicated

leave_house(get_dressed(get_out_of_the_bed(wake_up(me))))

Well, as you can see there are too many dimensions and with tidyverse you can basically split it up into so-called Pipelines, for them you use a Pipe %>% :

me %>% wake_up() %>% get_out_of_the_bed() %>% get_dressed() %>% leave_house

It is the same code, in R this code would do the same. But the advantage is that it is way more intuitive and makes the code clear. Here an example:

First, I create a vector with three random numbers named q.
Then I take the mean, then the exponential and lastly the square root. I could just wrap the codes around each other like this: sqrt(exp(mean(q))).
But I could also use pipes, where I clearly see that first the mean, then the exponential and then the square root is taken: q %>% mean() %>% exp() %>% sqrt(). Run both codes, they are producing the same result.

#Making a vector with three random numbers
q <- c(6,3,8)

#Taking first the mean, second the exponential and lastly the square root
sqrt(exp(mean(q)))

## [1] 17.00204

# With a Pipe 
q %>% 
  mean() %>%
  exp() %>%
  sqrt()

## [1] 17.00204

This was an easy and short example, but as you will see at the end of this chapter, the code can get really fast really messy and to have a clean code, we will use pipes. Furthermore, tidyverse users use pipes all the times, so even if you do not use them, you have to understand them.

2.4 Let’s wrangle the data: The `dplyr` package

The dplyr package is THE standard package, when it comes to data manipulation (next to Base R of course). It has essential functions, and helpful further functions. If you are able to understand the flexibility of these functions you can easily handle every data set.

2.4.1 The `filter()` command

The first function I introduce you is the filter() function. Within the filter() function we can define certain conditions to cut our Data Set to.

We need the filter() function and then a variable we want to filter based on. In my example, I only want to keep all observations, which have “HU” in their cntry - variable. Substantially this means, I cut down to all observations from Hungary. I use the == operator since I want to have all observations where the condition is true.

Here is a quick reminder of logical operators in R:

Logical Operator	Meaning
==	equals
<	smaller than
>	greater than
<=	less than or equal to
>=	greater than or equal to
! (e.g. !=; >!; <!…)	not equal, not greater than, not smaller than
&	element-wise AND operator. It returns TRUE if both elements are true
\|	element-wise OR operator. It returns TRUE if one of the statements is TRUE

2.4.1.1 Filtering for only one condition

We start by defining a new object, let us call it d1.
Then we take the data we want to filter, in our case ess.
We define a pipe, write down filter and define a condition, in our case that only cases where cntry is equal to the iso2c code of Hungary, "HU".
In the next code, we do the same and filter for cases that are equal or smaller than 40. Thus we get a dataset with observations who are 40 or younger.

Note: Since the cntry variable is a character variable, we have to put the condition in square brackets. The variable agea is a numeric variable, therefore we only need the number.

#filtering for cases only in Hungary
d1 <- ess %>% 
  filter(cntry == "HU")

#checking it
head(d1)

##   idnt year cntry agea gndr happy eisced netusoft trstprl
## 1 4501 2020    HU   21    2     3      6        4       3
## 2 4502 2020    HU   48    1     5      5        3       5
## 3 4503 2020    HU   80    1     6      3        4       4
## 4 4504 2020    HU   38    1     6      3        2       1
## 5 4505 2020    HU   40    2    10      3        3       9
## 6 4506 2020    HU   40    1     1      7        2       2
##   lrscale
## 1       2
## 2       3
## 3       1
## 4       3
## 5       7
## 6       3

#We only want participants younger than 40
d2 <- ess %>%  
  filter(agea <= 40)

#checking it
head(d2)

##   idnt year cntry agea gndr happy eisced netusoft trstprl
## 1    3 2020    BE   28    2     3      4        5       4
## 2    8 2020    BE   28    2     5      1        2       1
## 3    9 2020    BE   39    2     8      3        1       4
## 4   12 2020    BE   23    1    77      1        4       4
## 5   14 2020    BE   40    2     3      5        2       3
## 6   15 2020    BE   21    1     1      4        1       8
##   lrscale
## 1       7
## 2       1
## 3       6
## 4       7
## 5       2
## 6       5

2.4.1.2 Filtering for multiple condition

#filtering for cases in Hungary and France
d1 <- ess %>% 
  filter(cntry %in% c("HU", "FR"))

#Checking it
head(d1)

##   idnt year cntry agea gndr happy eisced netusoft trstprl
## 1 2701 2020    FR   41    9     7      7        8       8
## 2 2702 2020    FR   64    1     2      5        4       5
## 3 2703 2020    FR   90    1     5      6        5       1
## 4 2704 2020    FR   23    1     9      3        3       9
## 5 2705 2020    FR   30    2     8      7        2       7
## 6 2706 2020    FR   42    2    10     88        9       9
##   lrscale
## 1       6
## 2       4
## 3       8
## 4       3
## 5       9
## 6       5

#filtering for cases under 40 in Hungary and France
d2 <- ess %>%
  filter(cntry %in% c("HU", "FR") &
           agea <= 40)

#Checking it
head(d2)

##   idnt year cntry agea gndr happy eisced netusoft trstprl
## 1 2704 2020    FR   23    1     9      3        3       9
## 2 2705 2020    FR   30    2     8      7        2       7
## 3 2708 2020    FR   17    1     1      2        2       8
## 4 2709 2020    FR   26    2    10      1        4       9
## 5 2710 2020    FR   34    1     8      3        1       3
## 6 2711 2020    FR   30    1     5      3        5       9
##   lrscale
## 1       3
## 2       9
## 3       3
## 4       6
## 5       8
## 6      10

#d2 <- ess %>% #Our dataset
#  filter(cntry %in% c("HU", "FR"), 
#           agea <= 40) #filtering for cases under 40 in Hungary and France with a comma
#head(d2)

2.4.2 The `select()` function:

We obviously do not care about all variables a dataset can offer (mostly). To select the variables we need, we can use the select() function. This of course depends on our research question. Let us say we want to select the year, the country, the happy variable, age, gender, income and education.

2.4.2.1 Selecting and deleting single Rows

We only have to pass their column names to select(). That’s it.

#Selecting relevant variables
d1 <- ess %>%
   select(year, cntry, happy, agea, gndr, eisced) 

#Checking it
head(d1)

##   year cntry happy agea gndr eisced
## 1 2020    BE     1   45    1      6
## 2 2020    BE     3   65    1      5
## 3 2020    BE     3   28    2      4
## 4 2020    BE     1   81    1      7
## 5 2020    BE     4   56    2      7
## 6 2020    BE     3   64    1      5

2.4.2.2 Deleting Rows

To delete row, you just put a minus in front of the column, you want to delete. That’s it, the rest stays the same.

#We delete columns by simply putting a comma before it
d2 <- d1 %>% 
  select(-agea)

#Checking it
head(d2)

##   year cntry happy gndr eisced
## 1 2020    BE     1    1      6
## 2 2020    BE     3    1      5
## 3 2020    BE     3    2      4
## 4 2020    BE     1    1      7
## 5 2020    BE     4    2      7
## 6 2020    BE     3    1      5

2.4.2.3 Combining `select()` with the `filter()` function

One huge advantage of piping is, that we can clear our data in one step or at least big steps. The only thing you have to be aware of is what you put there. Remember the last command, is always the first to be executed.

In this example, we select the 7 variables and then filter it for all observations under 40.
The two commands are separated by a pipe.

#Combining codes
d1 <- ess %>%
  filter(agea < 40) %>%
  select(year, cntry, happy, agea, gndr, eisced)

#Checking it
head(d1)

##   year cntry happy agea gndr eisced
## 1 2020    BE     3   28    2      4
## 2 2020    BE     5   28    2      1
## 3 2020    BE     8   39    2      3
## 4 2020    BE    77   23    1      1
## 5 2020    BE     1   21    1      4
## 6 2020    BE     5   23    2      6

2.4.3 The `arrange()` function

If we want our data to be in a certain order, we arrange it with this function.

2.4.3.1 Arranging in ascending order

The arrange() function is called, and afterwards we call the variable we want to arrange based on. The default function of arrange, orders the data always in ascending order.

#Adding arrange()
d1 <- ess %>%
  filter(agea < 40) %>%
  select(year, cntry, happy, agea, gndr, eisced) %>% 
  arrange(agea) 

#Checking it
head(d1)

##   year cntry happy agea gndr eisced
## 1 2020    BE     6   15    2      4
## 2 2020    BE     2   15    1      1
## 3 2020    BG     3   15    9     99
## 4 2020    BG     4   15    1      6
## 5 2020    BG     2   15    2      2
## 6 2020    BG     2   15    1      3

2.4.3.2 Arranging in descending order

If we want to order them in ascending order, we have to call desc() inside the arrange() and put the name of the variable inside desc().

#arranging in descending order 
d1 <- ess %>%
  filter(agea < 40) %>%
  select(year, cntry, happy, agea, gndr, eisced) %>% 
  arrange(desc(agea))

#Checking it
head(d1)

##   year cntry happy agea gndr eisced
## 1 2020    BE     8   39    2      3
## 2 2020    BE     1   39    1      2
## 3 2020    BE    10   39    2      5
## 4 2020    BE     2   39    1      3
## 5 2020    BE     6   39    2      7
## 6 2020    BE     3   39    1      1

2.4.4 The `rename()` and `relocate()` function

Two functions to make our dataset structured more useful are rename() and relocate(), well they do what they basically named after:

2.4.4.1 Renaming Variables: `rename()`

Rename() follows a simple logic, you call the function and write down the new name, thus the name you want to assign, then you put an equal sign, and put in the old name, thus the current name of the column.

Note: If you have a variable, which is binary, thus has two discrete categories, you name it after the category which corresponds to the higher value. If male is 0 and female is 1, you name it after the higher category 1, therefore the variable is named female.

#Renaming variables
d1 <- ess %>%
  filter(agea < 40) %>%
  select(year, cntry, happy, agea, gndr, eisced) %>% 
  arrange(desc(agea)) %>%
  rename(country = cntry, 
         age = agea, 
         education = eisced, 
         female = gndr) 

#Checking it
head(d1)

##   year country happy age female education
## 1 2020      BE     8  39      2         3
## 2 2020      BE     1  39      1         2
## 3 2020      BE    10  39      2         5
## 4 2020      BE     2  39      1         3
## 5 2020      BE     6  39      2         7
## 6 2020      BE     3  39      1         1

2.4.4.2 Relocating Variables: `relocate()`

You call relocate and determine the order of the columns.
You can also rearrange single columns, you call the name of the column, which you want to rarrange. and then you either call .before and a column or .after and a column. And the column you want to rearrange is placed before or after the column you want to place. You separate both arguments with a comma.

#relocating variables
d1 <- ess %>%
  filter(agea < 40) %>%
  select(year, cntry, happy, agea, gndr, eisced) %>% 
  arrange(desc(agea)) %>%
  rename(country = cntry, 
         age = agea, 
         education = eisced, 
         female = gndr) %>% 
  relocate(education, age, female, country, happy, year) #determine the order

#Checking it
head(d1)

##   education age female country happy year
## 1         3  39      2      BE     8 2020
## 2         2  39      1      BE     1 2020
## 3         5  39      2      BE    10 2020
## 4         3  39      1      BE     2 2020
## 5         7  39      2      BE     6 2020
## 6         1  39      1      BE     3 2020

#relocate after 
d2 <- ess %>%
  filter(agea < 40) %>%
  select(year, cntry, happy, agea, gndr, eisced) %>% 
  arrange(desc(agea)) %>%
  rename(country = cntry, 
         age = agea, 
         education = eisced, 
         female = gndr) %>%
  relocate(country, .after = age) 

#Checking it
head(d2)

##   year happy age country female education
## 1 2020     8  39      BE      2         3
## 2 2020     1  39      BE      1         2
## 3 2020    10  39      BE      2         5
## 4 2020     2  39      BE      1         3
## 5 2020     6  39      BE      2         7
## 6 2020     3  39      BE      1         1

#relocating before 
d3 <- ess %>%
  filter(agea < 40) %>%
  select(year, cntry, happy, agea, gndr, eisced) %>% 
  arrange(desc(agea)) %>%
  rename(country = cntry, 
         age = agea, 
         education = eisced, 
         female = gndr) %>%
  relocate(country, .before = age) 

#Checking it
head(d3)

##   year happy country age female education
## 1 2020     8      BE  39      2         3
## 2 2020     1      BE  39      1         2
## 3 2020    10      BE  39      2         5
## 4 2020     2      BE  39      1         3
## 5 2020     6      BE  39      2         7
## 6 2020     3      BE  39      1         1

2.4.5 The `mutate()` function

The next function is the powerful mutate() command. For the start, just think about mutate as a Variable with which you can transform or mutate variables to other variables as you please.

You call mutate() and first you define the name of a new column with a name that is not existent in your dataset. You could also use the name of an existing column, but be careful! Then the new values are overwriting the old ones and we do not want that necessarily, therefore I recommend to always define new columns.
After defining the new name you put an equal sign after it and define the calculation. In our case we just multiply happy, so all values in the happy column by 10

#mutating variables
d1 <- ess %>%
  mutate(happy_10 = happy*10) 

#checking it
head(d1)

##   idnt year cntry agea gndr happy eisced netusoft trstprl
## 1    1 2020    BE   45    1     1      6        2       2
## 2    2 2020    BE   65    1     3      5        4       3
## 3    3 2020    BE   28    2     3      4        5       4
## 4    4 2020    BE   81    1     1      7        3       2
## 5    5 2020    BE   56    2     4      7        5       6
## 6    6 2020    BE   64    1     3      5        4       6
##   lrscale happy_10
## 1       3       10
## 2       6       30
## 3       7       30
## 4       6       10
## 5       4       40
## 6      10       30

We can also use mutate for more than calculations. We can also define new columns by mutating existing columns into different classes:

The variable new_variable is just a random mathematical operation including two variables.
The second call in mutate() changes the class of the gndr variable to a character and saves it as such in the dataset.

#more mutating
d2 <- ess %>% 
  mutate(new_variable = happy*10/eisced+67, 
         female_char = as.character(gndr)) %>% 
  select(female_char, new_variable)

#Checking it
head(d2)

##   female_char new_variable
## 1           1     68.66667
## 2           1     73.00000
## 3           2     74.50000
## 4           1     68.42857
## 5           2     72.71429
## 6           1     73.00000

What makes mutate() so powerful is that you can use other functions in it to define your variables as you want. Image you have the happy variable. Quick reminder, this variable contains the answers to the question “How happy are you?” in the questionnaire of the European Social Survey. It is scaled on a 0 (not happy at all) to 10(very happy).

Let us say we want to change that. We want to make a variable with only three categories (unhappy, neutral, happy). We decide that all values from 0-4 should be classifies as unhappy, 5 should be classified as neutral and everything above 5 as happy. How can we do that in R?

There are different ways and I will show you some of them:

2.4.5.1 Recoding with `mutate()` using `recode().`

You can use the recode() function:

First, we define a new column name, in our example happy_cat.
Then we call recode() and inside of it we call the variable we want to transform, in our case happy.
We put call the category we want to transform in these brackets ``. We put an equal sign after it and define the new value we want to assign to the category.

Note: We can assign different values, do not worry about the “NA_real_” value I will come back to that later

#Get an overview of the variable
table(ess$happy)

## 
##   1   2   3   4   5   6   7   8   9  10  77  88  99 
## 849 918 888 897 888 852 883 877 908 895  49  46  50

#recoding variables
d1 <- ess %>% 
  mutate(
    gndr_fac = as.factor(gndr), #always check the class
    happy_cat = dplyr::recode(happy,
                    `1` = 0,
                    `2` = 0,
                    `3` = 0,
                    `4` = 0,
                    `5` = 1,
                    `6` = 2,
                    `7` = 2,
                    `8` = 2,
                    `9` = 2,
                   `10` = 2,
                   `77` = NA_real_, 
                   `88` = NA_real_,
                   `99` = NA_real_),
    female = dplyr::recode(gndr_fac, 
                    `1` = "Male", 
                    `2` = "Female",
                    `9` = NA_character_))

#Let us check how it worked out 
table(d1$happy_cat)

## 
##    0    1    2 
## 3552  888 4415

table(d1$female)

## 
##   Male Female 
##   4381   4404

2.4.5.2 Recoding with `mutate()` using `case_when()`

As you see, the recode() command is quite extensive. The tidyverse offers a way more intuitive command, the case_when() function. The function case_when is a generalized ifelse function. Which means we can use logical operators. The recode() function is way too extensive, it gives you full control over the data, but we do not that much control:

We again define happy_cat.
We call case_when() and inside we call our variable we want to transform.
We define a logical statements, in our case that happy smaller than 5, meaning that we tell R that all values under 5 should be transformed.
We call the wave ~ and tell R what value should substitute all values which are TRUE for the logical statement.

Note: What is not explicitly stated in the case_when() function will be coded as NA.

#recoding with case_when
d1 <- ess %>% 
  mutate(gndr_fac = as.factor(gndr),
    happy_cat = case_when(
    happy < 5 ~ 0,
    happy == 5 ~ 1,
    happy > 5 ~ 2),
    female = case_when(
    gndr == 1 ~ "Male",
    gndr == 2 ~ "Female"
    ))

#Checking it
table(d1$female)

## 
## Female   Male 
##   4404   4381

table(d1$happy_cat)

## 
##    0    1    2 
## 3552  888 4560

2.4.5.3 Recoding with `mutate()` using `ifelse()`

Do you remember the ifelse() function? As already mentioned, the case_when() command is a generalized ifelse() function. If you want to keep it old school, we can also recode with the ifelse() function:

#recoding with ifelse function
d1 <- ess %>% 
  mutate(gndr_fac = as.factor(gndr),
         happy_cat = ifelse(happy < 5, 0, 
                            ifelse(happy == 5, 1, 
                                   ifelse(happy > 5, 2, NA
                                          ))),
         female = ifelse(gndr_fac == 1, "Male",
                         ifelse(gndr_fac == 2, "Female", NA))
  )

#Check it
table(d1$happy_cat)

## 
##    0    1    2 
## 3552  888 4560

table(d1$female)

## 
## Female   Male 
##   4404   4381

2.4.6 Handling Missing Values/Incomplete Data

As you saw right now, not all data in a dataset is complete. Of course not, there are several sources, which can lead to incomplete/missing data. Can you think of reasons why?

In Data Sciences we need to deal with missing values directly. If we look into the codebook, the ESS declares different types of missing values with high numbers: 7(7) means “Refusal”, so the respondent refused to answer, 8(8) means “dont know”, and 9(9) “No answer”.

Note that the ESS does so, since some researchers are interested in missing values, and why they happen, so they can investigate it.

For us, this is a problem, because we cannot run an analysis with missing values. There are two options:

Using statistics to artificially fill them out, this called multiple imputation techniques, but this requires advanced data science knowledge so I do not recommend that for beginners.
Just delete incomplete observations to have a dataset without missing values

The ESS assigns values to missing values. First, we have to tell R that we do want those values to be missing values otherwise it biases our analyses. In the following, we have to recode the variables and tell R, that the missing values are declared as such:

I already showed you how to do so. In the following I will use case_when()

We will do the second one, and I already showed how to recode useless values to NAs so this should be clear by now. Remember, the ifelse() and recode() explicitly need input to turn values into NAs.

#Creating missing values and showing a mutating workflow
d1 <- ess %>%
  filter(agea >=40) %>% 
  select(year, cntry, netusoft, agea, eisced, gndr, happy) %>% 
  arrange(desc(agea)) %>% 
  rename(
    internet_use = netusoft,
    age = agea, 
    education = eisced, 
    female = gndr) %>% 
  mutate( 
    internet_use = case_when( 
      internet_use > 5 ~ NA_real_, 
      TRUE ~ internet_use), 
    age = case_when(
      age == 999 ~ NA_real_,
      TRUE ~ age), 
    education = case_when(
      education %in% c(55, 77, 88, 99) ~ NA_real_,
      TRUE ~ education), 
    female = case_when(
      female == 1 ~ 0, 
      female == 2 ~ 1, 
      female == 9 ~ NA_real_, 
      TRUE ~ female),
    happy = case_when(
      happy %in% c(77, 88, 99) ~ NA_real_,
      TRUE ~ happy)
    ) %>%
  arrange(age)

#Checking it
head(d1)

##   year cntry internet_use age education female happy
## 1 2020    BE            2  40         5      1     3
## 2 2020    BE            2  40         6      0    10
## 3 2020    BE            4  40         1      0     9
## 4 2020    BE            2  40         5      0     6
## 5 2020    BE            3  40         6      1     9
## 6 2020    BE            2  40         6      0     4

When you print the dataset, you see that now there are some values named “NA”. That stands for not available.

If we want to delete NAs, we can use the tidyverse way by simply piping to drop_na()
The base R way would be to put the dataset name na.omit().

We assign both to a new data frame. We will see in the result that all rows are deleted which include NAs. This is one of the reasons it is important to cut down to variables we only need, so that only incomplete observations of our variables we need are deleted. So always remember, first transforming, than dropping.

#Checking it
head(d1)

##   year cntry internet_use age education female happy
## 1 2020    BE            2  40         5      1     3
## 2 2020    BE            2  40         6      0    10
## 3 2020    BE            4  40         1      0     9
## 4 2020    BE            2  40         5      0     6
## 5 2020    BE            3  40         6      1     9
## 6 2020    BE            2  40         6      0     4

#dropping NAs 
d2 <- d1 %>% 
  drop_na() 

#Checking if there are NAs
colSums(is.na(d2))

##         year        cntry internet_use          age 
##            0            0            0            0 
##    education       female        happy 
##            0            0            0

#dropping NAs
d3 <- na.omit(d1) 

#Checking if there are NAs
colSums(is.na(d2))

##         year        cntry internet_use          age 
##            0            0            0            0 
##    education       female        happy 
##            0            0            0

We see that there are no missing values left in our dataset, and our number of observations are reduced.

Now we have all ingredients to make our dataset. Do you remember our research question? We want to find out if people, who tend to not trust science are less willing to get vaccinated. We want to do that for all people over 40.

2.4.7 The `group_by()` and `summarize()` functions

Two of the most useful commands in R for summary statistics are group_by() and summarize().

group_by() helps us to, when we have categorical variables with several observations and we want to calculate a metric e.g. for this group. The dataset we have loaded, the ESS for example asked 9000 respondents about their level of happiness. Image you are interested in the average level of happiness of men and women. Here the group_by() functions defines the groups we want to aggregate e.g. gender.

2.4.7.1 With one grouping variable and one metric

summarize() defines the metric we want to search. For example we want to calculate the mean of the level of happiness of men and women. We also could just calculate the median for example. Let us have a look:

First, we call the group_by() function and define the group we want to aggregate, thus the group we are interested in. In our example, we want to aggregate based on the sexes, therefore we need the to define that. Before that we have to delete NAs or transform the variable to the categories we are interested in, therefore we first call mutate() and transform the variable.
Second, we call summarize() and define the name of the new column, let us call it average_happiness() and then we call the metric of the variable we are interested in. In our example, we were interested in the average happiness, so we have to call mean() and the happy variable:

#group_by and summarize
d1 <- ess %>% 
  mutate(
    gndr_fac = as.factor(gndr),
    female = case_when(
    gndr_fac == 1 ~ "Male", 
    gndr_fac == 2 ~ "Female",
    gndr_fac == 9 ~ NA_character_
  )) %>%
  drop_na() %>%
  group_by(female) %>% 
  dplyr::summarize(average_happiness = mean(happy))

#Checking it
head(d1)

## # A tibble: 2 × 2
##   female average_happiness
##   <chr>              <dbl>
## 1 Female              6.54
## 2 Male                7.17

Et voilà, we get a dataset with two observations, because we have only two groups. The second row is the average_happiness row we defined in the summarize() function.

2.4.7.2 With more grouping variables and metrics

The group_by() and summarize() functions are of course way more flexible, for example, we can define more groups. What about that, you are interested in the level of happiness of females, and males in the countries conducted by the ESS. You just put the countries and then the gender variable in the group_by(). Further we might be interested in more metrics, no problem, let us just define more columns with summarize(). Again, you have to first clean the data by transforming the class and deleting missing values.

#grouping and summarize
d1 <- ess %>% 
  mutate(
    country = cntry,
    gndr_fac = as.factor(gndr),
    female = case_when(
    gndr_fac == 1 ~ "Male", 
    gndr_fac == 2 ~ "Female",
    gndr_fac %in% c(77, 88, 99) ~ NA_character_),
    age = case_when(
      agea == 999 ~ NA_real_,
      TRUE ~ agea)
    ) %>%
  drop_na() %>%
  group_by(country, female) %>% 
  dplyr::summarize(average_happiness = mean(happy), 
            median_happiness = median(happy), 
            average_age = mean(age), 
            meadian_age = median(age)
          )

## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by country and female.
## ℹ Output is grouped by country.
## ℹ Use `summarise(.groups = "drop_last")` to silence this
##   message.
## ℹ Use `summarise(.by = c(country, female))` for per-operation
##   grouping (`?dplyr::dplyr_by`) instead.

#Check it out
head(d1)

## # A tibble: 6 × 6
## # Groups:   country [3]
##   country female average_happiness median_happiness
##   <chr>   <chr>              <dbl>            <dbl>
## 1 BE      Female              6.14              6  
## 2 BE      Male                7.79              6  
## 3 BG      Female              6.26              6  
## 4 BG      Male                8.69              6  
## 5 CH      Female              7.17              5  
## 6 CH      Male                5.72              5.5
## # ℹ 2 more variables: average_age <dbl>, meadian_age <dbl>

2.5 Merging Datasets

2.5.1 Introduction to merging with `dplyr` and preparing data

Sometimes it could be the case that you need variables, which are not in one dataset a priori available, but in another dataset. For this case you load both datasets and merge them together. This only works if there is a similiar data structure, so know your data !

As an example, I will show how to do that with World Bank Data. From this data we can gather nearly all important economic indicators for countries since the 1970s. But mostly we need to merge them to datasets we are interested in. We will merge the World Bank Data with the ESS data. So we can analyze variables, which were not collected in the same dataset.

There are several ways of getting World Bank Data, but I will show you the most efficient. There is the package WDI with which you can get data through an API (Application Programming Interface). Long story short, we do not need to download anything and get the data directly with code:

First we define, which countries should be included:

countries <- c("BE", "BG", "CH", "EE", "FR","GB")

Afterwards we define, which variables we want. You do that by using the official indicator, thus the variable you want. You can find the indicators on the website of the world bank data. Click on the variables you want, then click on the details, and there you find the indicator. I will use GDP per capita, Fuel exports, CO2 emissions (kt).

indicator = c("NY.GDP.PCAP.CD", "TX.VAL.FUEL.ZS.UN", "EN.ATM.CO2E.KT")

Now we are ready to use the API.

To do so we call the WDI function.
We define the argument country to only get countries we are interested in.
We also define the indicators.
Lastly, with the “start =” argument we define the starting year, so data which goes back to that date is loaded and with “end =” is analagos to define where the time should stop. Thus, both arguments define the time span we want to inspect

#wb <- WDI( 
#country = countries, #We include our countries 
#indicator = indicator, #We include our variables 
#start = 2020, #start date 
#end = 2020) #end date 

#This takes some time, especially if you have more countries, more indicators and a longer time span.

#Checking it
#head(wb)

#Simulating the data 
wb <- data.frame(
  iso2c = c("BE", "BG", "CH", "EE", "FR", "GB"), 
  NY.GDP.PCAP.CD = c(45587.97, 10148.34, 85897.78, 23565.18, 39179.74, 40217.01), 
  TX.VAL.FUEL.ZS.UN = c(5.021, 4.644, 0.6111, 4.863, 1.886, 7.062), 
  EN.ATM.CO2E.KT = c(85364.10, 34138.10, 34916.10, 7097.52, 267154.70, 308650.30)
)

#Checking it
head(wb)

##   iso2c NY.GDP.PCAP.CD TX.VAL.FUEL.ZS.UN EN.ATM.CO2E.KT
## 1    BE       45587.97            5.0210       85364.10
## 2    BG       10148.34            4.6440       34138.10
## 3    CH       85897.78            0.6111       34916.10
## 4    EE       23565.18            4.8630        7097.52
## 5    FR       39179.74            1.8860      267154.70
## 6    GB       40217.01            7.0620      308650.30

Let us transform the dataset (Only variables we need, arranging it alphabetically, renaming it and rounding one variable to make the numbers more intuitive):

#Cleaning the wb data
wb <- wb %>%
  select(iso2c, NY.GDP.PCAP.CD, TX.VAL.FUEL.ZS.UN, EN.ATM.CO2E.KT) %>%
  arrange(iso2c) %>%
  rename(gdp_per_cap = NY.GDP.PCAP.CD,
         fuel_exp = TX.VAL.FUEL.ZS.UN,
         co2 = EN.ATM.CO2E.KT
         ) %>% 
  mutate(fuel_exp = round(fuel_exp, 2))

#Checking it
head(wb)

##   iso2c gdp_per_cap fuel_exp       co2
## 1    BE    45587.97     5.02  85364.10
## 2    BG    10148.34     4.64  34138.10
## 3    CH    85897.78     0.61  34916.10
## 4    EE    23565.18     4.86   7097.52
## 5    FR    39179.74     1.89 267154.70
## 6    GB    40217.01     7.06 308650.30

Now, we cut down and prepare our ESS data by selecting the countries we are interested in, renaming the country variable (I’ll explain later why), we group by the country (iso2c) and the year (year) to get the average happiness by country. Lastly, we round the value to get only two decimals.

#preparing ess
d1 <- ess %>%
  filter(cntry == c("BE", "BG", "CZ", "EE", "FI")) %>%
  rename(iso2c = cntry) %>%
  group_by(iso2c, year) %>% 
  summarise(happy_agg = round(mean(happy), 2))
  
#Checking it
head(d1)

## # A tibble: 5 × 3
## # Groups:   iso2c [5]
##   iso2c  year happy_agg
##   <chr> <dbl>     <dbl>
## 1 BE     2020      7.53
## 2 BG     2020      6.41
## 3 CZ     2020      7.08
## 4 EE     2020      8.52
## 5 FI     2020      7.51

2.5.2 `left_join()` and `right_join()` with one identifier

To merge data there are important functions from the dplyr package: The left_join() and the right_join() function. Since they are a bit complicated to understand, we will go through them and in the end, you can choose the one you prefer.

left_join(): You want to keep all observations in the first table, including matching observations in the second table. You merging the data from the right table to left table and get a datset with the same number of rows as the left table:

right_join(): You want to keep all observations in the second table, including matching observations in the first table. You join from the left table to the right table, this time the table takes on the number of observations of the table which is right joined.

Well, in the end of the day it is a matter of programming socialisation and taste, which one do you prefer. I will show you both.

To merge two datasets, you need at least one common variable. One variable will be your unique identifier. Since every country is unique in the dataset that is our unique identifier i.e. the variable we give R to tell him how to merge the datasets:

First, we define a new object, where we will save the dataset called merged_data.
Second, we call left_join()
Then we set our left table and our right table.
We define the by = argument and put the unique identifier in quotation marks, meaning the variable name

#left_join
merged_data <- left_join(d1, wb, 
                         by = "iso2c")

#Checking it
head(merged_data)

For right_join() you do the exact same, but remember you get a different result.

#right_join
merged_data2 <- right_join(d1, wb, 
                           by = "iso2c")

#Checking it
head(merged_data2)

## # A tibble: 6 × 6
## # Groups:   iso2c [6]
##   iso2c  year happy_agg gdp_per_cap fuel_exp     co2
##   <chr> <dbl>     <dbl>       <dbl>    <dbl>   <dbl>
## 1 BE     2020      7.53      45588.     5.02  85364.
## 2 BG     2020      6.41      10148.     4.64  34138.
## 3 EE     2020      8.52      23565.     4.86   7098.
## 4 CH       NA     NA         85898.     0.61  34916.
## 5 FR       NA     NA         39180.     1.89 267155.
## 6 GB       NA     NA         40217.     7.06 308650.

2.5.3 `left_join()` and `right_join()` with two identifiers

Sometimes, you have multiple dimensions. For example, what if we also include the year? Then every country-year observation is our unique identifier. Why? Because one observation was collected in country X to time point Y. That is why you should always know your data and your research goal, because accordingly you have to write your code.

Let us get again World Bank Data and clean it:

#Getting the Data
#wb <- WDI( 
#  country = c("BE", "BG"), #We include our countries 
#  indicator = indicators, #We include our variables 
#  start = 2019, #start date 
#  end = 2020) #end date 

#Simulating the data 
wb <- data.frame(
  iso3c = c("BEL", "BEL", "BGR", "BGR"),
  iso2c = c("BE", "BE","BG", "BG"), 
  year = c(2019, 2020, 2019, 2020),
  NY.GDP.PCAP.CD = c(46641.72, 45587.97, 9874.336, 10148.34), 
  TX.VAL.FUEL.ZS.UN = c(7.38, 5.02, 9.53, 4.64), 
  EN.ATM.CO2E.KT = c(92989.4, 85364.10, 39159.9, 34138.10)
)

#Cleaning the Data
wb <- wb %>% 
  select(-iso3c) %>%
  arrange(iso2c) %>%
  rename(gdp_per_cap = NY.GDP.PCAP.CD,
         fuel_exp = TX.VAL.FUEL.ZS.UN,
         co2 = EN.ATM.CO2E.KT
         ) %>% 
  mutate(fuel_exp = round(fuel_exp, 2))

#Checking the Data
head(wb)

##   iso2c year gdp_per_cap fuel_exp     co2
## 1    BE 2019   46641.720     7.38 92989.4
## 2    BE 2020   45587.970     5.02 85364.1
## 3    BG 2019    9874.336     9.53 39159.9
## 4    BG 2020   10148.340     4.64 34138.1

Now we just simulate some data we want to merge:

#Getting the Data
d1 <- data.frame(
  iso2c = c("BE", "BE", "BG", "BG", "CZ", "CZ"), 
  year = c(2019, 2020, 2019, 2020, 2019, 2020), 
  happy_agg = c(5.95, 6.76, 6.56, 7.54, 6.27, 6.88)
)

#Checking the Data
head(d1)

##   iso2c year happy_agg
## 1    BE 2019      5.95
## 2    BE 2020      6.76
## 3    BG 2019      6.56
## 4    BG 2020      7.54
## 5    CZ 2019      6.27
## 6    CZ 2020      6.88

How does it look like when we have two variables and the combination out of those is our unique identifier?

left_join() with two identifiers:

As you can see the dataset again takes on the number of observations of our left table
To implement it, we have to change by argument. We define a vector, where we put our two identifiers in quotation marks:

#Merging the Data with left_join()
merged_data3 <- left_join(d1, wb,
                          by = c("iso2c", "year"))
#Checking it
head(merged_data3)

##   iso2c year happy_agg gdp_per_cap fuel_exp     co2
## 1    BE 2019      5.95   46641.720     7.38 92989.4
## 2    BE 2020      6.76   45587.970     5.02 85364.1
## 3    BG 2019      6.56    9874.336     9.53 39159.9
## 4    BG 2020      7.54   10148.340     4.64 34138.1
## 5    CZ 2019      6.27          NA       NA      NA
## 6    CZ 2020      6.88          NA       NA      NA

right_join() with two identifiers:

Again we do the same with right_join():

merged_data4 <- right_join(d1, wb,
                          by = c("iso2c", "year"))
head(merged_data4)

##   iso2c year happy_agg gdp_per_cap fuel_exp     co2
## 1    BE 2019      5.95   46641.720     7.38 92989.4
## 2    BE 2020      6.76   45587.970     5.02 85364.1
## 3    BG 2019      6.56    9874.336     9.53 39159.9
## 4    BG 2020      7.54   10148.340     4.64 34138.1

Note: However, this chapter only touched the basics and for merging alone there are several further commands, like inner_join(), anti_join(), semi_join()…etc. But when you encounter problems with the two functions I showed you will run into them eventually.

2.6 Outlook Data Manipulation

This Chapter introduced you to the basic functions of the dplyr package. You are now able to transform variables according to your needs. Further, you learned how to use pipes to work efficient code. Loading data, transforming data, and preparing datasets for the analysis. There are a lot of techniques, and in the end of the day data wrangling is the most extensive part, because every analysis is individual and requires an individual preparation. The more individual the analysis, the more individual the preparation.

I recommend the standard book for data science in R in this chapter, since it has a strong emphasis on data manipulation: “R for Data Science” by Hadley Wickham & Garrett Grolemund.

2.7 Exercises Data Manipulation

2.7.1 Exercise 1: Let’s wrangle kid

You are interested in discrimination and the perception of the judicial. More specifically, you want to know if people, who fell discriminated evaluate courts differently. Below you see a table with all variables you want to include in your analysis:

Variable	Description	Scales
idnt	Respondent’s identification number	unique number from 1-9000
year	The year when the survey was conducted	only 2020
cntry	Country	BE, BG, CH, CZ, EE, FI, FR,GB, GR, HR, HU, IE, IS, IT, LT,NL, NO, PT, SI, SK
agea	Age of the Respondent, calculated	Number of Age = 15-90 999 = Not available
gndr	Gender	1 = Male; 2 = Female; 9 = No answer
happy	How happy are you	0 (Extremly unhappy) - 10 (Extremly happy); 77 = Refusal; 88 = Don’t Know; 99 = No answer
eisced	Highest level of education, ES - ISCED	0 = Not possible to harmonise into ES-ISCED; 1 (ES-ISCED I , less than lower secondary) - 7 (ES-ISCED V2, higher tertiary education, => MA level; 55 = Other; 77 = Refusal; 88 = Don’t know; 99 = No answer
netusoft	Internet use, how often	1 (Never) - 5 (Every day); 7 = Refusal; 8 = Don’t know; 9 = No answer
trstprl	Most people can be trusted or you can’t be too careful	0 (You can’t be too careful) - 10 (Most people can be trusted); 77 = Refusal; 88 = Don’t Know; 99 = No answer
lrscale	Left-Right Placement	0 (Left) - 10 (Right); 77 = Refusal; 88 = Don’t know; 99 = No answer

Wrangle the data, and assign it to an object called ess.
Select the variables you need
Filter for Austria, Belgium, Denmark, Georgia, Iceland and the Russian Federation
Have a look at the codebook and code all irrelevant values as missing. If you have binary variables recode them from 1, 2 to 0 to 1
You want to build an extremism variable: You do so by subtracting 5 from the from the variable and squaring it afterwards. Call it extremism
Rename the variables to more intuitive names, don’t forget to name binary varaibles after the category which is on 1
drop all missing values
Check out your new dataset

#Assign new data
ess2 <-

2.7.2 Exercise 2: Merging Datasets

The gapminder package in R loads automatically the gapminder dataset. The gapminder project is an independent educational non-profit fighting global misconceptions, check out their website: https://www.gapminder.org/ The gapminder dataset is already loaded.

Get an overview of the gapminder dataset. There are different ways to do so, you can choose by yourself.

#Getting an overview

Load World Bank Data from 1972 to 2007 and load the variable “Exports and Goods (% of GDP)”.
Merge the World Bank data to the gapminder data, so a dataset evolves with the numbess2er of observations of the gapminder data.

#merge world bank and gapminder dataset

d. Clean the data by dropping all missing values

#Cleaning Process

Chapter 2 Data Manipulation

2.1 Packages

2.1.1 The tidyverse package

2.2 Working with packages

2.2.1 Installing packages

2.2.2 Loading your packages

2.3 The Data we will work with: The European Social Survey (ESS)

2.3.1 Load the data

2.3.1.1 How to load data

File

File Extension

Package

Command

2.3.1.2 Working Directories and R-Projects

2.3.2 One last thing: Pipelines

2.4 Let’s wrangle the data: The dplyr package

2.4.1 The filter() command

2.4.1.1 Filtering for only one condition

2.4.1.2 Filtering for multiple condition

2.4.2 The select() function:

2.4.2.1 Selecting and deleting single Rows

2.4.2.2 Deleting Rows

2.4.2.3 Combining select() with the filter() function

2.4.3 The arrange() function