Chapter 8 Solutions Exercises
8.1 Chapter 1: Fundamentals
8.1.1 Exercise 1: Making your first Vector
Create a vector called my_vector with the values 1,2,3 and check is class.
8.1.2 Exercise 2: Making your first matrix
Create a Matrix called student. This should contain information about the name, age and major. Make three vectors with three entries and bind them together to a the matrix student. Print the matrix. Choose the three names, age, and major by yourself:
8.1.3 Exercise 3: ifelse function
Write an ifelse statement that checks if a given number is positive or negative. If the number is positive or 0, print “Number is positive”, otherwise print “Number is negative”. Feel free to decide if you want to use the ifelse function or the ifelse condition.
8.1.4 Exercise 4: ifelse ladders
Write an if-else ladder that categorizes a student’s grade based on their score. The grading criteria are as follows:
Score >= 90: “A” Score >= 80 and < 90: “B” Score >= 70 and < 80: “C” Score >= 60 and < 70: “D” Score < 60: “F”.
#Defining the vector score
Score <- 90
#Defing ifelse ladder with ifelse() function
ifelse(Score >= 90, "A",
ifelse(Score >= 80 & Score < 90, "B",
ifelse(Score >= 70 & Score < 80, "C",
ifelse(Score >= 60 & Score < 70, "D",
ifelse(Score < 60, "F")))))
#Defining it with a ifelse condition
if (score >= 90) {
print("A")
} else if (score >= 80 & score < 90) {
print("B")
} else if (score >= 70 & score < 80) {
print("C")
} else if (score >= 60 & score < 70) {
print("D")
} else if (score < 60) {
print("F")
}8.2 Chapter 2: Data Manipulation
8.2.1 Exercise 1: Let’s wrangle kid
You are interested in discrimination and the perception of the judicial. More specifically, you want to know if people, who fell discriminated evaluate courts differently. Below you see a table with all variables you want to include in your analysis:
| Variable | Description | Scales |
|---|---|---|
| idnt | Respondent’s identification number | unique number from 1-9000 |
| year | The year when the survey was conducted | only 2020 |
| cntry | Country | BE, BG, CH, CZ, EE, FI, FR,GB, GR, HR, HU, IE, IS, IT, LT,NL, NO, PT, SI, SK |
| agea | Age of the Respondent, calculated | Number of Age = 15-90 999 = Not available |
| gndr | Gender | 1 = Male; 2 = Female; 9 = No answer |
| happy | How happy are you | 0 (Extremly unhappy) - 10 (Extremly happy); 77 = Refusal; 88 = Don’t Know; 99 = No answer |
| eisced | Highest level of education, ES - ISCED | 0 = Not possible to harmonise into ES-ISCED; 1 (ES-ISCED I , less than lower secondary) - 7 (ES-ISCED V2, higher tertiary education, => MA level; 55 = Other; 77 = Refusal; 88 = Don’t know; 99 = No answer |
| netusoft | Internet use, how often | 1 (Never) - 5 (Every day); 7 = Refusal; 8 = Don’t know; 9 = No answer |
| trstprl | Most people can be trusted or you can’t be too careful | 0 (You can’t be too careful) - 10 (Most people can be trusted); 77 = Refusal; 88 = Don’t Know; 99 = No answer |
| lrscale | Left-Right Placement | 0 (Left) - 10 (Right); 77 = Refusal; 88 = Don’t know; 99 = No answer |
- Wrangle the data, and assign it to an object called ess.
- Select the variables you need
- Filter for Austria, Belgium, Denmark, Georgia, Iceland and the Russian Federation
- Have a look at the codebook and code all irrelevant values as missing. If you have binary variables recode them from 1, 2 to 0 to 1
- You want to build an extremism variable: You do so by subtracting 5 from the from the variable and squaring it afterwards. Call it extremism
- Rename the variables to more intuitive names, don’t forget to name binary varaibles after the category which is on 1
- drop all missing values
- Check out your new dataset
ess <- d1 %>%
select(cntry, dscrgrp, agea, gndr, eisced, lrscale) %>% #a
filter(cntry %in% c("AT", "BE", "DK", "GE", "IS","RU")) %>% #b
mutate(dscrgrp = case_when( #c
dscrgrp == 1 ~ 0,
dscrgrp == 2 ~ 1,
dscrgrp %in% c(7, 8, 9) ~ NA_real_,
TRUE ~ dscrgrp),
agea = case_when(
agea == 999 ~ NA_real_,
TRUE ~ agea),
gndr = case_when(
gndr == 1 ~ 0,
gndr == 2 ~ 1,
gndr == 9 ~ NA_real_
),
eisced = case_when(
eisced %in% c(55, 77, 88, 99) ~ NA_real_,
TRUE ~ eisced),
lrscale = case_when(
lrscale %in% c(77, 88, 99) ~ NA_real_,
TRUE ~ lrscale) #d, you could do this step also in a separate mutate function if you think that is more intuitive
) %>%
mutate(extremism = (lrscale - 5)^2) %>%
rename(discriminated = dscrgrp,
court = cttresa,
age = agea,
female = gndr,
education = eisced,
lrscale = lrscale,
extremism = extremism
) %>%
drop_na()
#Checking the dataset
head(ess)8.3 Chapter 3: Data Visualisation
In this exercise Section, we will work with the babynames and the iris package. This is a classic built-in package in R, which contains data from the Ronald Fisher’s 1936 Study “The use of multiple measurements in taxonomic problems”. It contains three plant species and four measured features for each species. Let us get an overview of the package:
## Sepal.Length Sepal.Width Petal.Length
## Min. :4.300 Min. :2.000 Min. :1.000
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
## Median :5.800 Median :3.000 Median :4.350
## Mean :5.843 Mean :3.057 Mean :3.758
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
## Max. :7.900 Max. :4.400 Max. :6.900
## Petal.Width Species
## Min. :0.100 setosa :50
## 1st Qu.:0.300 versicolor:50
## Median :1.300 virginica :50
## Mean :1.199
## 3rd Qu.:1.800
## Max. :2.500
8.3.1 Exercise 1: Distributions
a. Plot a Chart, which shows the distribution of Sepal.Length over the setosa Species. Choose the type of distribution chart for yourself. HINT: Prepare the data first and then plot it.
#Filtering for setosa
d1 <- iris %>%
filter(Species == "setosa")
#Histogram
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
# EDIT: You could also do everything in one step as shown below
#Histogram
#iris %>%
#filter(Species == "setosa") %>%
#ggplot(aes(x = Sepal.Length)) +
#geom_histogram()
#Density Plot
#iris %>%
#filter(Species == "setosa") %>%
#ggplot(aes(x = Sepal.Length)) +
#geom_density()
#Boxplot
#iris %>%
#filter(Species == "setosa") %>%
#ggplot(aes(x = Sepal.Length)) +
#geom_boxplot()b. Now I want you to add the two other Species to the Plot. Make Sure, that every Species has a unique color.
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
#Ridgeline Plot
ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) +
geom_density_ridges()## Picking joint bandwidth of 0.181
c. Make a nice Plot! Give the Plot a meaningful title, meaningful labels for the x-axis and the y-axis and play around with the colors.
#Histogram
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_histogram() +
labs(
x = "Species",
y = "Sepal Length",
title = "Distribution of Sepal Length across Species"
) +
theme_bw() +
theme(
legend.position = "none"
) +
scale_fill_brewer(palette = "Dark2")## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
#Density Plot
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_density() +
labs(
x = "Species",
y = "Sepal Length",
title = "Distribution of Sepal Length across Species"
) +
theme_bw() +
theme(
legend.position = "none"
) +
scale_fill_brewer(palette = "Dark2")#Boxplot
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_boxplot() +
labs(
x = "Species",
y = "Sepal Length",
title = "Distribution of Sepal Length across Species"
) +
theme_bw() +
theme(
legend.position = "none"
) +
scale_fill_brewer(palette = "Dark2")#Ridgeline Plot
ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_violin() +
labs(
x = "Species",
y = "Sepal Length",
title = "Distribution of Sepal Length across Species"
) +
theme_bw() +
theme(
legend.position = "none"
) +
scale_fill_brewer(palette = "Dark2")#Ridgeline Plot
ggplot(iris, aes(x = Sepal.Length, y = Species, fill = Species)) +
geom_density_ridges() +
labs(
x = "Species",
y = "Sepal Length",
title = "Distribution of Sepal Length across Species"
) +
theme_bw() +
theme(
legend.position = "none"
) +
scale_fill_brewer(palette = "Dark2")## Picking joint bandwidth of 0.181
d. Interpret the Plot!
8.3.2 Exercise 2: Rankings
a. Calculate the average Petal.Length and plot it for every Species in a nice Barplot. HINT: You have to prepare the data again before you plot it. Decide for yourself if you want to do it vertically or horizontally.
#Prepare the data
d1 <- iris %>%
group_by(Species) %>%
dplyr::summarize(PL_average = mean(Petal.Length))
#Check it
head(d1) ## # A tibble: 3 × 2
## Species PL_average
## <fct> <dbl>
## 1 setosa 1.46
## 2 versicolor 4.26
## 3 virginica 5.55
#Plotting a horizontal barplot
ggplot(d1, aes(x = Species, y = PL_average, fill = Species)) +
geom_bar(stat = "identity")#Plotting a vertical barplot
ggplot(d1, aes(x = PL_average, y = Species, fill = Species)) +
geom_bar(stat = "identity") b. Make a nice Plot! Give the Plot a meaningful title, meaningful labels for the x-axis and the y-axis and play around with the colors.
#Plotting a horizontal barplot
ggplot(d1, aes(x = Species, y = PL_average, fill = Species)) +
geom_bar(stat = "identity") +
labs(
x = "Species",
y = "Average Sepal Length",
title = "Average Sepal Length across Species"
) +
theme_bw() +
theme(
legend.position = "none"
) +
scale_fill_brewer(palette = "Set1")#Plotting a vertical barplot
ggplot(d1, aes(x = PL_average, y = Species, fill = Species)) +
geom_bar(stat = "identity") +
labs(
x = "Average Sepal Length",
y = "Species",
title = "Average Sepal Length across Species"
) +
theme_bw() +
theme(
legend.position = "none"
) +
scale_fill_brewer(palette = "Accent")8.3.3 Exercise 3: Correlation
a. Make a scatter plot where you plot Sepal.length on the x-axis and Sepal.width on the y-axis. Make the plot for the species virginica
#Preparing the data
d1 <- iris %>%
filter(Species == "virginica")
#Making the plot
ggplot(d1, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()b. Now I want you to add the species versicolor to the plot. The dots of this species should have a different color AND a different form.
#Preparing the Data
d1 <- iris %>%
filter(Species %in% c("virginica", "versicolor"))
#Making the Plot
ggplot(d1, aes(x = Sepal.Length, y = Sepal.Width,
shape = Species, color = Species)) +
geom_point() c. Make a nice plot! Add a theme, labels, and a nice title, increase the size of the forms and make play around with the colors.
#Making the Plot
ggplot(d1, aes(x = Sepal.Length, y = Sepal.Width,
shape = Species, color = Species)) +
geom_point() +
labs(
title = "Relationship between Sepal Length and Sepal Width",
x = "Sepal Length",
y = "Sepal Width"
) +
scale_color_brewer(palette = "Set1") +
theme_classic()8.4 Chapter 4: Exploratory Data Analysis
8.4.1 Exercise 1: Standard Descriptive Statistics
In this exercise we will work with the built-in iris package in R:
a. Calculate the mode, mean and the median for the iris$Sepal.Length variable
## [1] 5.843333
## [1] 5.8
#Calculating the mode
uniq_vals <- unique(iris$Sepal.Length)
freqs <- tabulate(iris$Sepal.Length)
uniq_vals[which.max(freqs)]## [1] 5
# You could also define a function for the mode
#mode <- function (x) {
# uniq_vals <- unique(x, na.rm = TRUE)
# freqs <- tabulate(match(x, uniq_vals))
# uniq_vals[which.max(freqs)]
#}
#mode(iris$Sepal.Length)b. Calculate the interquartile range, variance and the standard deviation for iris$Sepal.Length
## [1] 1.3
## [1] 0.6856935
## [1] 0.8280661
c. Calculate all five measures at once by using a function that does so (Choose by yourself, which one you want to use)
## vars n mean sd median trimmed mad min max range skew
## X1 1 150 5.84 0.83 5.8 5.81 1.04 4.3 7.9 3.6 0.31
## kurtosis se
## X1 -0.61 0.07
| Name | iris$Sepal.Length |
| Number of rows | 150 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.8 | 6.4 | 7.9 | ▆▇▇▅▂ |
## Descriptive Statistics
## iris$Sepal.Length
## N: 150
##
## Sepal.Length
## ----------------- --------------
## Mean 5.84
## Std.Dev 0.83
## Min 4.30
## Q1 5.10
## Median 5.80
## Q3 6.40
## Max 7.90
## MAD 1.04
## IQR 1.30
## CV 0.14
## Skewness 0.31
## SE.Skewness 0.20
## Kurtosis -0.61
## N.Valid 150.00
## N 150.00
## Pct.Valid 100.00
8.4.2 Exercise 2: Contingency Tables and Correlations
a. Make a Contingency Table for esoph$agegp and esoph$alcgp
##
## 0-39g/day 40-79 80-119 120+
## 25-34 4 4 3 4
## 35-44 4 4 4 3
## 45-54 4 4 4 4
## 55-64 4 4 4 4
## 65-74 4 3 4 4
## 75+ 3 4 2 2
## Cross-Tabulation, Row Proportions
## agegp * alcgp
## Data Frame: esoph
##
## ------- ------- ------------ ------------ ------------ ------------ -------------
## alcgp 0-39g/day 40-79 80-119 120+ Total
## agegp
## 25-34 4 (26.7%) 4 (26.7%) 3 (20.0%) 4 (26.7%) 15 (100.0%)
## 35-44 4 (26.7%) 4 (26.7%) 4 (26.7%) 3 (20.0%) 15 (100.0%)
## 45-54 4 (25.0%) 4 (25.0%) 4 (25.0%) 4 (25.0%) 16 (100.0%)
## 55-64 4 (25.0%) 4 (25.0%) 4 (25.0%) 4 (25.0%) 16 (100.0%)
## 65-74 4 (26.7%) 3 (20.0%) 4 (26.7%) 4 (26.7%) 15 (100.0%)
## 75+ 3 (27.3%) 4 (36.4%) 2 (18.2%) 2 (18.2%) 11 (100.0%)
## Total 23 (26.1%) 23 (26.1%) 21 (23.9%) 21 (23.9%) 88 (100.0%)
## ------- ------- ------------ ------------ ------------ ------------ -------------
|
alcgp
|
Total | ||||
|---|---|---|---|---|---|
| 0-39g/day | 40-79 | 80-119 | 120+ | ||
| agegp | |||||
| 25-34 | 4 | 4 | 3 | 4 | 15 |
| 35-44 | 4 | 4 | 4 | 3 | 15 |
| 45-54 | 4 | 4 | 4 | 4 | 16 |
| 55-64 | 4 | 4 | 4 | 4 | 16 |
| 65-74 | 4 | 3 | 4 | 4 | 15 |
| 75+ | 3 | 4 | 2 | 2 | 11 |
| Total | 23 | 23 | 21 | 21 | 88 |
b. Cut down the iris dataset to Sepal.Length, Sepal.Width, Petal.Length and Petal.Width and save it in an object called iris_numeric.
c. Make a correlation matrix with iris_numeric
d. Make the correlation matrix pretty
# With Corrplot
corrplot::corrplot(cor_iris, method = "circle", type = "upper",
tl.col = "black", tl.srt = 45)# With DataExplorer
DataExplorer::plot_correlation(
iris_numeric,
ggtheme = theme_minimal(base_size = 14),
title = "Correlation Matrix of Iris Numeric Variables"
)8.4.3 Exercise 3: Working with packages
a. Use a function to get an overview of the dataset mtcars
| Name | mtcars |
| Number of rows | 32 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| numeric | 11 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 20.09 | 6.03 | 10.40 | 15.43 | 19.20 | 22.80 | 33.90 | ▃▇▅▁▂ |
| cyl | 0 | 1 | 6.19 | 1.79 | 4.00 | 4.00 | 6.00 | 8.00 | 8.00 | ▆▁▃▁▇ |
| disp | 0 | 1 | 230.72 | 123.94 | 71.10 | 120.83 | 196.30 | 326.00 | 472.00 | ▇▃▃▃▂ |
| hp | 0 | 1 | 146.69 | 68.56 | 52.00 | 96.50 | 123.00 | 180.00 | 335.00 | ▇▇▆▃▁ |
| drat | 0 | 1 | 3.60 | 0.53 | 2.76 | 3.08 | 3.70 | 3.92 | 4.93 | ▇▃▇▅▁ |
| wt | 0 | 1 | 3.22 | 0.98 | 1.51 | 2.58 | 3.33 | 3.61 | 5.42 | ▃▃▇▁▂ |
| qsec | 0 | 1 | 17.85 | 1.79 | 14.50 | 16.89 | 17.71 | 18.90 | 22.90 | ▃▇▇▂▁ |
| vs | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| am | 0 | 1 | 0.41 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| gear | 0 | 1 | 3.69 | 0.74 | 3.00 | 3.00 | 4.00 | 4.00 | 5.00 | ▇▁▆▁▂ |
| carb | 0 | 1 | 2.81 | 1.62 | 1.00 | 2.00 | 2.00 | 4.00 | 8.00 | ▇▂▅▁▁ |
## Data Frame Summary
## mtcars
## Dimensions: 32 x 11
## Duplicates: 0
##
## ----------------------------------------------------------------------------------------------------------
## No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
## ---- ----------- --------------------------- -------------------- ------------------- ---------- ---------
## 1 mpg Mean (sd) : 20.1 (6) 25 distinct values : 32 0
## [numeric] min < med < max: : . (100.0%) (0.0%)
## 10.4 < 19.2 < 33.9 . : :
## IQR (CV) : 7.4 (0.3) : : : .
## : : : : :
##
## 2 cyl Mean (sd) : 6.2 (1.8) 4 : 11 (34.4%) IIIIII 32 0
## [numeric] min < med < max: 6 : 7 (21.9%) IIII (100.0%) (0.0%)
## 4 < 6 < 8 8 : 14 (43.8%) IIIIIIII
## IQR (CV) : 4 (0.3)
##
## 3 disp Mean (sd) : 230.7 (123.9) 27 distinct values : 32 0
## [numeric] min < med < max: . : (100.0%) (0.0%)
## 71.1 < 196.3 < 472 : : : : : :
## IQR (CV) : 205.2 (0.5) : : : : : : .
## : : : . : : : . :
##
## 4 hp Mean (sd) : 146.7 (68.6) 22 distinct values . : 32 0
## [numeric] min < med < max: : : (100.0%) (0.0%)
## 52 < 123 < 335 : : : .
## IQR (CV) : 83.5 (0.5) : : : :
## : : : : . .
##
## 5 drat Mean (sd) : 3.6 (0.5) 22 distinct values : 32 0
## [numeric] min < med < max: : : (100.0%) (0.0%)
## 2.8 < 3.7 < 4.9 : : .
## IQR (CV) : 0.8 (0.1) . : : :
## : : : : .
##
## 6 wt Mean (sd) : 3.2 (1) 29 distinct values : 32 0
## [numeric] min < med < max: : : (100.0%) (0.0%)
## 1.5 < 3.3 < 5.4 : :
## IQR (CV) : 1 (0.3) : : : : : .
## : : : : : . :
##
## 7 qsec Mean (sd) : 17.8 (1.8) 30 distinct values : 32 0
## [numeric] min < med < max: : (100.0%) (0.0%)
## 14.5 < 17.7 < 22.9 : :
## IQR (CV) : 2 (0.1) . : : : :
## : : : : : : : .
##
## 8 vs Min : 0 0 : 18 (56.2%) IIIIIIIIIII 32 0
## [numeric] Mean : 0.4 1 : 14 (43.8%) IIIIIIII (100.0%) (0.0%)
## Max : 1
##
## 9 am Min : 0 0 : 19 (59.4%) IIIIIIIIIII 32 0
## [numeric] Mean : 0.4 1 : 13 (40.6%) IIIIIIII (100.0%) (0.0%)
## Max : 1
##
## 10 gear Mean (sd) : 3.7 (0.7) 3 : 15 (46.9%) IIIIIIIII 32 0
## [numeric] min < med < max: 4 : 12 (37.5%) IIIIIII (100.0%) (0.0%)
## 3 < 4 < 5 5 : 5 (15.6%) III
## IQR (CV) : 1 (0.2)
##
## 11 carb Mean (sd) : 2.8 (1.6) 1 : 7 (21.9%) IIII 32 0
## [numeric] min < med < max: 2 : 10 (31.2%) IIIIII (100.0%) (0.0%)
## 1 < 2 < 8 3 : 3 ( 9.4%) I
## IQR (CV) : 2 (0.6) 4 : 10 (31.2%) IIIIII
## 6 : 1 ( 3.1%)
## 8 : 1 ( 3.1%)
## ----------------------------------------------------------------------------------------------------------
| Characteristic | N = 321 |
|---|---|
| mpg | 19.2 (15.4, 22.8) |
| cyl | |
| 4 | 11 (34%) |
| 6 | 7 (22%) |
| 8 | 14 (44%) |
| disp | 196 (121, 334) |
| hp | 123 (96, 180) |
| drat | 3.70 (3.08, 3.92) |
| wt | 3.33 (2.54, 3.65) |
| qsec | 17.71 (16.89, 18.90) |
| vs | 14 (44%) |
| am | 13 (41%) |
| gear | |
| 3 | 15 (47%) |
| 4 | 12 (38%) |
| 5 | 5 (16%) |
| carb | |
| 1 | 7 (22%) |
| 2 | 10 (31%) |
| 3 | 3 (9.4%) |
| 4 | 10 (31%) |
| 6 | 1 (3.1%) |
| 8 | 1 (3.1%) |
| 1 Median (Q1, Q3); n (%) | |
b. Have a look at the structure of the missing values in mtcars
## # A tibble: 11 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 mpg 0 0
## 2 cyl 0 0
## 3 disp 0 0
## 4 hp 0 0
## 5 drat 0 0
## 6 wt 0 0
## 7 qsec 0 0
## 8 vs 0 0
## 9 am 0 0
## 10 gear 0 0
## 11 carb 0 0
## rows columns discrete_columns continuous_columns
## 1 32 11 0 11
## all_missing_columns total_missing_values complete_rows
## 1 0 0 32
## total_observations memory_usage
## 1 352 5928
## Descriptions
## 1 Sample size (nrow)
## 2 No. of variables (ncol)
## 3 No. of numeric/interger variables
## 4 No. of factor variables
## 5 No. of text variables
## 6 No. of logical variables
## 7 No. of identifier variables
## 8 No. of date variables
## 9 No. of zero variance variables (uniform)
## 10 %. of variables having complete cases
## 11 %. of variables having >0% and <50% missing cases
## 12 %. of variables having >=50% and <90% missing cases
## 13 %. of variables having >=90% missing cases
## Value
## 1 32
## 2 11
## 3 11
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 100% (11)
## 11 0% (0)
## 12 0% (0)
## 13 0% (0)
c. Make an automized EDA report for mtcars!
dlookr::describe(mtcars)
DataExplorer::create_report(mtcars)
8.5 Chapter 5: Data Analysis
To understand linear regression, we do not have to load any complicated data. Let us assume, you are a market analyst and your customer is the production company of a series called “Breaking Thrones”. The production company wants to know how the viewers of the series judge the final episode. You conduct a survey and ask people on how satisfied they were with the season final and some social demographics. Here is your codebook:
| Variable | Description |
|---|---|
| id | The id of the respondent |
| satisfaction | The answer to the question “How satisfied were you with the final episode of Breaking Throne?”, where 0 is completely dissatisfied and 10 completely satisfied |
| age | The age of the respondent |
| female | The Gender of the respondent, where 0 = Male, 1 = Female |
Let us generate the data:
#setting seed for reproduciability
set.seed(123)
#Generating the data
final_BT <- data.frame(
id = c(1:10),
satisfaction = round(rnorm(10, mean = 6, sd = 2.5)),
age = round(rnorm(10, mean = 25, sd = 5)),
female = rbinom(10, 1, 0.5)
)
#Print the Data Frame
print(final_BT)## id satisfaction age female
## 1 1 5 31 0
## 2 2 5 27 0
## 3 3 10 27 0
## 4 4 6 26 0
## 5 5 6 22 0
## 6 6 10 34 0
## 7 7 7 27 0
## 8 8 3 15 0
## 9 9 4 29 0
## 10 10 5 23 1
8.5.1 Exercise 1: Linear Regression with two variables
You want to know if age has an impact on the satisfaction with the last episode. You want to conduct a linear regression.
a. Calculate \(\beta_0\) and \(\beta_1\) by hand
#Calculating Covariance
cov <- sum((final_BT$age - mean(final_BT$age)) * (final_BT$satisfaction - mean(final_BT$satisfaction)))
#Calculating Variance
var <- (sum((final_BT$age - mean(final_BT$age))^2))
#Calculating beta_1
beta_1 <- cov/var
#printing beta_1
print(beta_1)## [1] 0.2466586
#calculating beta 1
beta_0 <- mean(final_BT$satisfaction) - (beta_1 * mean(final_BT$age))
#Print both
print(beta_0)## [1] -0.3377886
## [1] 0.2466586
b. Calculate \(\beta_0\) and \(\beta_1\) automatically with R
##
## Call:
## lm(formula = satisfaction ~ age, data = final_BT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8153 -1.0820 -0.2053 0.8530 3.6780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3378 3.4796 -0.097 0.9251
## age 0.2467 0.1310 1.883 0.0964 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.058 on 8 degrees of freedom
## Multiple R-squared: 0.3072, Adjusted R-squared: 0.2206
## F-statistic: 3.547 on 1 and 8 DF, p-value: 0.0964
c. Interpret all quantities of your result: Standard Error, t-statistic, p-value, confidence intervals and the \(R^2\).
d. Check for influential outliers
#Cooks Distance can be calculated with a built-in function
final_BT$cooks_distance <- cooks.distance(m1)
#Plotting it
ggplot(final_BT, aes(x = age, y = cooks_distance)) +
geom_point(colour = "darkgreen", size = 3, alpha = 0.5) +
labs(y = "Cook's Distance", x = "Independent Variable") +
geom_hline(yintercept = 1, linetype = "dashed") +
theme_bw()8.5.2 Exercise 2: Multivariate Regression
a. Add the variable female to your regression
##
## Call:
## lm(formula = satisfaction ~ age + female, data = final_BT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8401 -1.1312 -0.0574 0.8001 3.6435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1712 3.8481 -0.044 0.966
## age 0.2418 0.1429 1.692 0.134
## female -0.3895 2.3662 -0.165 0.874
##
## Residual standard error: 2.196 on 7 degrees of freedom
## Multiple R-squared: 0.3099, Adjusted R-squared: 0.1127
## F-statistic: 1.571 on 2 and 7 DF, p-value: 0.2731
b. Interpret the Output. What has changed? What stood the same?
c. Make an interaction effect between age and female and interpret it!
##
## Call:
## lm(formula = satisfaction ~ age + female + age:female, data = final_BT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8401 -1.1312 -0.0574 0.8001 3.6435
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1712 3.8481 -0.044 0.966
## age 0.2418 0.1429 1.692 0.134
## female -0.3895 2.3662 -0.165 0.874
## age:female NA NA NA NA
##
## Residual standard error: 2.196 on 7 degrees of freedom
## Multiple R-squared: 0.3099, Adjusted R-squared: 0.1127
## F-statistic: 1.571 on 2 and 7 DF, p-value: 0.2731
d. Plot the interaction and make the plot nice
plot_model(m2, type = "int") +
scale_x_continuous(breaks = seq(0,60, 1)) +
labs(title = "Relationship of Age and Gender on Satisfaction with BT.",
x = "Age",
y = "Satisfaction") +
scale_color_manual(
values = c("red", "blue"),
labels = c("Female", "Male")
) +
theme_sjplot() +
theme(legend.position = "bottom",
legend.title = element_blank())## Warning in predict.lm(model, newdata = data_grid, type =
## "response", se.fit = se, : prediction from rank-deficient
## fit; attr(*, "non-estim") has doubtful cases
## Scale for colour is already present.
## Adding another scale for colour, which will replace the
## existing scale.
8.6 Chapter 6: Loops and Functions Exercises
8.6.1 Exercise 1: Writing a loop
Write a for loop that prints the square of each number from 1 to 10
#Assigning an object for a better workflow
number <- 10
#correct loop
for (i in 1:number) {
print(i^2)
}## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
8.6.3 Exercise 3: The midnight Formula
This is the midnight formula separated in two equations:
\(x_{1,2} = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}\)
Make one function for the midnight formula, so the output are \(x_1\) a d \(x_2\). Test it with a = 2, b = -6, c = -8
Hint: You need two split up the formula into two equations with two outputs.