Chapter 5 Data Analysis
In this chapter you I introduce you to the basic ideas of statistical analysis and hypothesis testing. For me, it is important that you get an intuition about what is going on rather than terrorizing you with complicated math. Although, I cannot and will not leave out central formulas, you will be fine with the maths. There extensively commented and in the end of the day not that hard to understand. You will notice that the programming in this chapter is quite easy compared to before. In the end, you only need formulas, who do everything you need to analyse data, that is also a strength of R. Please concentrate more on the concepts and get an idea what is going on. The goal of this chapter is that you get to know linear regression and how to check if the model fits and the results are robust and significant.
5.1 Linear Regression
5.1.1 Terminology
The classical (bivariate) linear regression can be expressed in the following equation (systematic component):
\[ Y_i = \beta_0 + \beta_1X_i + e_i \]
where
the index i runs over the observations (Respondents, Countries,…), i = 1,…,n
\(Y_i\) is the dependent variable, the variable we want to explain
\(X_i\) is the independent variable or explanatory variable.
\(\beta_0\) is the intercept of the regression line
\(\beta_1\) is the slope of the regression line
\(\epsilon_i\) is the error term, thus how our observed data differs from actual population data (e.g. Measurement Error).
5.1.2 Estimating the Ordinary Least Squares Estimator
To get the idea of linear regression, let us look at an example. To do so, let us simulate some data and plot it.You now should have the data frame df in your environment. It contains a variable X, which is our independent variable. Y is also included, which is your dependent variable. You want to explain Y with your X. Let us plot the variables with a scatterplot:
ggplot(df, aes(x, y)) +
geom_point() +
theme_bw() +
scale_x_continuous(breaks = seq(0, 10, by = 1)) +
scale_y_continuous(breaks = seq(0, 20, by = 2))We can see that there has to be some relationship between both those variables: The higher x gets the higher y gets. Linear regression can help us investigate the relationship. We just have to take the formula and estimated \(\beta_0\) and \(\beta_1\) . To do so, we estimate the ordinary least square (OLS) estimator. To understand what the OLS estimator does, look at the scatter plot again: The OLS estimator fits a line through all the dots, that minimizes the distance to the dots as much as possible. Afterward, we only extract the intercept \(\beta_0\) (that is the point, where the line crosses the y-axis), and \(\beta_1\) (that is the slope of the line). However, since these are estimated values for our model we have to call them by convention \(\hat{\beta_0}\) and \(\hat{\beta_1}\) . We will denote them in the following as such.
5.1.2.1 Visualization
The visual estimation is good to get an intuition with the data, but we will see later, that it does not work with multiple independent variables. Further, it does not give much information, at least not as much as we would like to have.
ggplot(df, aes(x, y)) +
geom_point() +
theme_bw() +
scale_x_continuous(breaks = seq(0, 10, by = 1)) +
scale_y_continuous(breaks = seq(0, 30, by = 5)) +
geom_smooth(method = "lm", se = FALSE) +
geom_segment(aes(x = x, y = y, xend = x, yend = predict(lm(y ~ x, data = df))), linewidth = 0.5) ## `geom_smooth()` using formula = 'y ~ x'
5.1.2.2 Calculation per Hand
We could also just calculate \(\hat{\beta_0}\) and \(\hat{\beta_1}\) by hand. The formula for both are as follows:
\[ \hat{\beta_1} = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2} \]
where \(x_i\) is the answer of respondent i for our independent variable x, \(\bar{x}\) is the average answer of the respondents. Those two values are subtracted, thus the deviation from the mean is calculated. The same goes with our dependent variable y, where \(y_i\) is the answer of respondent i, and \(\bar{y}\) is the average answer of all respondents. his is done for every respondent i, thus n - times. This is displayed with the sum symbol. We just calculated the so-called covariance.
The covariance is then divided by the squared deviation to the average answer of our independent variable x. This will get us the estimated coefficient for our model \(\hat{\beta_1}\) . Sounds complicated but lets do it in R:
#Let us first get the covariance
cov <- sum((df$x - mean(df$x)) * (df$y - mean(df$y)))
#Now we get the variance of x
x_sq <- sum((df$x - mean(df$x))^2)
x_sq## [1] 246.1869
## [1] 1.539357
To get the intercept \(\hat{\beta_0}\) we have to take the result of the slope and implement it into this formula:
\[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}*\bar{x} \]
We multiply \(\hat{\beta_1}\) with the average of our independent variable X. The result is subtracted from the average of our dependent variable y.
## [1] 1.682133
Now we estimated our parameters and can display the for our model by simply putting it into the systematic component:
\[ Y_i = 1.68 + 1.54 * X_i \]
5.1.2.3 Automated Calculation
This procedure by hand is way to time-wasting. R has a built in function to calculate the parameter for us called lm() :
#running a linear regression
model1 <- lm(y ~ x,
data = df)
#Printing a summary of the model results
summary(model1)##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1985 -1.4215 -0.4309 1.5505 5.9720
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6821 1.0376 1.621 0.116
## x 1.5394 0.1621 9.496 2.97e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.543 on 28 degrees of freedom
## Multiple R-squared: 0.7631, Adjusted R-squared: 0.7546
## F-statistic: 90.18 on 1 and 28 DF, p-value: 2.974e-10
When we check the results, we see that we did everything right and get the exact same values. The interpretation of the coefficient is with a one-unit increase in the independent variable X, the dependent increases on average about 0.68 units, holding all else constant. But you noticed that you get more information on the model, than just the coefficients. We will get to that later.
5.1.3 Predictions with Linear Regression
We could calculated predictions based on our OLS calculations, reconsider the systematic component we calculated:
\[ \hat{y_i} = 1.68 + 1.54x_i \]
We just have to put in x-values and according to our model we would get a prediction of the respondents y-value \(\hat{y_i}\). We can do that for all x-values in our dataset and get a column with all the predicted values, let us call it (y_hat) :
#First, we calculate the predictions for y
df$y_hat <- 1.6821 + 1.5394*df$x
#We could also do it automatically via the predict() function
df$auto_y_hat <- predict(model1)
#Checking it
head(df)## x y categorical_variable y_hat
## 1 2.875775 6.680633 0 6.109068
## 2 7.883051 12.527668 1 13.817269
## 3 4.089769 10.029008 0 7.977891
## 4 8.830174 17.562679 1 15.275270
## 5 9.404673 18.312220 1 16.159653
## 6 0.455565 3.594825 0 2.383397
## auto_y_hat
## 1 6.108979
## 2 13.816967
## 3 7.977750
## 4 15.274927
## 5 16.159286
## 6 2.383411
5.2 Hypothesis Testing in R
We are not only interested if our model fits or not. We want to investigate real world phenomena. Well, and if our model does not fit well, we have to make adjustments so it does. What comes next is that we want to know more about the world. Researcher do so by formulating hypotheses. These are nothing more than assumptions you make theoretically about the world. Let us say you think you assume that in our world education has an impact on income. This would be your Alternative Hypothesis \(H_A\) . What you know want to do is to test it against the so-called Null Hypothesis \(H_0\). That is nothing more than the opposite of our alternative hypothesis, thus that education has no impact on income. Let us formulate both to have an overview:
\(H_0\) = Education does not impact Income.
\(H_A\) = The more educated a person is, the higher the income.
The advantage of this approach is obvious, one of those will be true. Therefore statistical testing is needed. But before introducing it to you, we have to decide how we want to test our alternative hypothesis. More specifically, how do we want to measure our variables. For our example, we decide to conduct a survey and to get data by asking a random sample of 1000 people over 18 living in Germany (N=1000). We decide to measure our independent variable Education, by asking the respondents about the years they invested in their education. To get their income, you just ask them about it to fill it in.
Can you think about possible critics about our data collection strategy?
5.2.1 Standard Error
5.2.1.1 Root Mean Square Error (RMSE)
Before moving on to Standard Errors, I will introduce another metric, the root mean square error. It is the average difference between the actual values \(y_i\) and our predicted values \(\hat{y_i}\). To calculate it, we square the residuals, to get only positive values. Then we take the mean, and lastly we take the square root:
\[ \text{RMSE} = \sqrt{\frac{\sum_{i=1}^{n}(\hat{y}_i - y_i)^2}{n}} \]
#Getting the residuals
residuals <- df$y_hat - df$y
#Getting the squared residuals
squared_residuals <- residuals^2
#Getting the mean of the squared residuals
mean_sq_resid <- mean(squared_residuals)
#Getting the RSME
rmse <- sqrt(mean_sq_resid)
#Or in one line
#rmse <- sqrt(mean((df$y_hat - df$y)^2))The rule of thumb is that the lower the RSME, the better. It is a non-standardized goodness of fit measure. Its counterpart is the R-squared measure, which is a standardized measure. You can use both, and should use both to get a metric about how close the predicted values are distributed around the actual values.
5.2.1.2 Standard Error of the Estimate
The standard error of a coefficient estimate is the metric directly presented next to the coefficient in the regression output. It is the square root of the variance of the regression coefficient.
#Getting the residuals
residuals <- df$y - df$y_hat
#getting sigma squared
sigma_sq <- sum(residuals^2) / (nrow(df) - 2)
#getting standard error of beta 1
se_beta1 <- sqrt(sigma_sq / sum((df$x - mean(df$x))^2))
#print it
se_beta1## [1] 0.1620985
If \(\hat{\beta}\) is bigger than \(2 \cdot SE(\hat{\beta})\) the estimate is statistically significant at roughly the 5% level. This is a quick mental check — the exact critical value is 1.96 for large samples and depends on the degrees of freedom for small ones. For a precise test, look at the p-value.
5.2.2 T-Value or T-Statistic
The formula for calculating the t-value is simple:
\[ t_i = \frac{\beta_i}{SE(\beta_i)} \]
where \(\beta_i\) is the coefficient calculated by the linear regression, and \(SE(\beta_i)\) is the standard error. Let us calculate it manually by hand for our example:
#t value by hand
t_value_intercept <- -0.32773/0.30271
t_value_x <- 0.67949/0.04813
#printing it
print(t_value_intercept) #-1.082653## [1] -1.082653
## [1] 14.11781
Well, that is the t-value on its own does not tell us about statistical significance. It is used to calculate the p-value, which tells us about the significance of the value. But before we calculate it, we have to understands some key concepts before.
Degrees of Freedom is the first concept. Let us say, we have a sample of me and my sister (N = 2). We collected the numbers of books each of us has, and calculated a mean of 30. I have 20 books. How many books does my sister have? Obviously, she has 40, if we have a mean of 30. Given my value, and the mean of the sample, the value of my sister had to be 20. That is what the degrees of freedom tells us. Given the mean of a sample, the values, which can freely be chosen. Or to put it more technically, the maximum number of logically independent values.
The rule is that the larger the sample, the higher the degrees of freedom and the lower the sample, the lower the degrees of freedom.
T-distributions are the distribution, we will use to calculate the p-value. I will talk about that in detail in a minute. But they are connected to the degrees of freedom, because degrees of freedom determine the tail behavior and shape of the curve until the point, where the t-distribution looks like a standard normal distribution. Let us visualize that:
#Generate data x <- seq(-5, 5, length.out = 100) #Calculate densities densities <- data.frame( x = rep(x, 4), density = c(dt(x, df = 1), dt(x, df = 2), dt(x, df = 10), dnorm(x, mean = 0, sd = 1)), distribution = rep(c("t(df=1)", "t(df=2)", "t(df=10)", "Normal"), each = 100) ) densities$distribution <- factor(densities$distribution, levels = c("Normal", "t(df=10)", "t(df=2)", "t(df=1)")) #Plot ggplot(densities, aes(x = x, y = density, color = distribution)) + geom_line() + theme_minimal() + labs(x = "x", y = "Density", title = "t-distributions with different degrees of freedom") + scale_color_manual(values = c("black", "red", "green", "blue")) + scale_x_continuous("X", seq(-5,5,1), limits = c(-5,5))Alright, before we find out, how to determine if a value allows us to reject the null hypothesis, we have to determine a so-called critical value. This is no math or anything, it is a probability we decide about. More precisely, the probability that our t-value and thus our coefficient occured by chance alone. If we say the probability that it occured by chance alone is 10%, then this is our critical value. The rule of thumb is, that a probability lower than 5% that the coefficient occured by chance alone indicates statistical significance. Keep in mind, that the critical value can vary depending on the sample size, the degrees of freedom, the field you are working in etc.
#setting seed
set.seed(42)
#Generate data
x <- seq(-5, 5, length.out = 100)
t_density <- function(x) dt(x, df = 28)
#Calculate densities
t_value_data <- data.frame(
x = rep(x, 1),
density = dt(x, df = 28),
distribution = rep("t(df=28)", 100)
)
#Plot
ggplot(t_value_data, aes(x = x, y = density)) +
geom_line(lineend = "round") +
stat_function(fun = t_density, geom = "area", fill = "gray",
alpha = 0.75, xlim = c(-5, -1.701), n = 10000) +
stat_function(fun = t_density, geom = "area", fill = "gray",
alpha = 0.75, xlim = c(5, 1.701), n = 10000) +
geom_vline(xintercept = -1.701, linetype = "dashed",
colour = "red") +
geom_vline(xintercept = 1.701, linetype = "dashed",
colour = "red") +
ggtitle("t-distribution with 28 df", subtitle = "The pink area marks the interval of significant values on a 95% level") +
geom_segment(x = -1.082653,
xend = -1.082653,
yend = dt(-1.082653, df = 28),
y = -1,
color = "pink",
linetype = "dashed",
linewidth = 0.2) +
annotate("point", x = -1.082653, y = dt(-1.082653, df = 28),
color = "pink") +
scale_x_continuous("X", seq(-5,5,1), limits = c(-5,5)) +
theme_classic() +
theme(legend.position = "none")- As we can see, the blue line representing the value of the intercept is not in the area it would have to be, for us to reject the null hypothesis. However, the t-value of our coefficient from variable is with about 14 far away from the threshold, so we can reject the null hypothesis. This is how the t-value works. The problem is, that the threshold varies with the degrees of freedom, and you will not have the threshold value in your head for every degree of freedom. You could look it up every time or you use p-values, which directly tell you the probability of the coefficient occuring by chance alone.
5.2.3 p-values
The p-value is a statistics that shows us the probability that a statistical measure (in our example 0) is greater/less than an observed value (in our example, the estimated coefficient). The null hypothesis would tell us that our independent variable X has no impact on Y. Statistically speaking, that would mean that our coefficient has a high probability to be zero, because zero means no effect, thus we cannot reject the null hypothesis. But if the probability that our estimated coefficient is 0 is low, we may reject the null hypothesis and would found an effect.
Before showing you two ways to find the p-value, we have to determine a critical value. This is no math or anything, it is a probability we decide about. The rule of thumb is that if the p-value is smaller than 5%, we can reject the null hypothesis. However, depending on various factors, it could also be different, that depends on your data, sample size, degrees of freedom etc.
For our example, we follow the rule of thumb and set the significance level to 0.05, thus if the coefficient has a smaller probability (p-value < 0.05) than 5% to be zero we can reject the null hypothesis, otherwise we cannot reject it.
Since the math is complicated and thus not help to understand the intuition, I directly show you how to calculate the p-value by hand:
#calculating the p values by hand
p_value_1 <- 2 * pt(-abs(t_value_intercept), 28)
p_value_2 <- 2 * pt(-abs(t_value_x), 28)
#printing it
print(p_value_1) ## [1] 0.2881981
## [1] 2.941061e-14
We get the same values as in our regression, and we decided before that our critical value (denoted as \(\alpha\)) should be less than 0.05 to reject the null hypothesis. Therefore we can reject the null hypothesis for the coefficient of our variable x. Since the coefficient is positive, the interpretation would be that x has on average a positive effect on y, since the probability that the value is different than 0 is lower than the critical value of 5%.
5.2.4 Confidence Interval
The last way of testing hypothesis are confidence intervals. I recommend to use them, since they are intuitive and easier to interpret than p-values. To understand confidence intervals, we must remember the difference between a population and a sample. The population are all people we want to infer to, for example in an election, the population are all citizens eligible to vote in that election. Let us assume, all else equal, that the the vote share in the population for Party A is 24%. This is what we call the true population parameter. And our goal is to estimate this parameter with statistical methods, since it is more efficient. Think about Germany, we cannot ask 80 million people before the election to give us their thoughts, so what we do instead, is to draw a representative sample of less citizens from that sample to infer to the population. The problem is that even if the sample is representative, it could be that our sample today gives us a vote share of Party A of 23%, but tomorrow we would get 25% (This could have several reasons, can you think of some?).
In the following, let us assume we draw 1000 samples before the election, to get the vote share of Party A. We know that our true population parameter is 24%. But before we start, we have to set a rule again. This rule is basically the definition of confidence intervals:
- In 95% of all samples, that could be drawn, the confidence intervals will cover the true population parameter.
So if we draw 1000 samples, in 950 the confidence intervals have to cover the true population parameter:
#Since this is a simulation we need to set a seed
set.seed(187)
#We will need to have vectors for the upper confidence interval and the lower one
lower_ci <- numeric(100)
upper_ci <- numeric(100)
estimates <- numeric(100)
#This loop represents
for(i in 1:length(lower_ci)) {
Y <- rnorm(100, mean = 24, sd = 2)
estimates[i] <- Y[i]
lower_ci[i] <- Y[i] - 1.96 * 24 / 10
upper_ci[i] <- Y[i] + 1.96 * 24 / 10
}
#Let us bind both vectors together
CIs <- data.frame(estimates, lower_ci, upper_ci)
#Print it
head(CIs)## estimates lower_ci upper_ci
## 1 22.86723 18.16323 27.57123
## 2 25.36001 20.65601 30.06401
## 3 26.32276 21.61876 31.02676
## 4 23.37235 18.66835 28.07635
## 5 22.71026 18.00626 27.41426
## 6 21.89276 17.18876 26.59676
Now we have drawn our 1000 samples and computed our estimates as well as their corresponding intervals. Thus, 1000 samples about the vote share of Party A. Let us check, if the true population parameter is 95% of times within our computed confidence intervals:
#Getting the true mean
true_mean <- 24
#First, we identify those who are not including 24 our true population parameter
CIs$missed <- ifelse(CIs$lower_ci > true_mean | CIs$upper_ci < true_mean, "Out", "In")
#Let us give every sample an identification number
CIs$id <- 1:nrow(CIs)
#Plotting it
ggplot(data = CIs) +
geom_pointrange(
aes(
x = estimates, #point value
xmin = lower_ci, #lower CI
xmax = upper_ci, #upper CI
y = id, #y axis-just observation number
color = missed
) #color varies by missed variable
) +
geom_vline(
aes(xintercept = true_mean), #add vertical line at true_mean
) +
scale_color_manual(values = c("azure4", "red")) +
theme_minimal() +
labs(
title = "Confidence Interval for Mean",
subtitle = "Population mean equals 24",
x = "Estimates",
y = "Sample",
color = "Is true population parameter inside the CI?"
) +
theme(legend.position = "top") + #switch the legend to the top
scale_x_continuous(breaks = c(seq(15, 30, by = 1)))Well, we can see that the majority of the computed intervals include the true population parameter. But there are three, which are not. That is within our definition.
What is now with confidence intervals for coefficients of linear regression? Well, it is the same story, we compute an estimate, in this case a coefficient. The true population parameter is unknown, but we know that the true population parameter is within 95% of the intervals.
Its calculation is fairly easy and you already saw it:
\[ CI_{lower} = \beta_i - 1.96 * SE(\beta_i) \\\ CI_{upper} = \beta_i + 1.96 * SE(\beta_i) \]
## 2.5 % 97.5 %
## (Intercept) -0.4432173 3.807484
## x 1.2073136 1.871401
#Let us compute the confidence values by hand for the intercept and x
ci_lower_int <- model1$coefficients[1] - 1.96 * summary(model1)$coef[, "Std. Error"][1]
ci_upper_int <- model1$coefficients[1] + 1.96 * summary(model1)$coef[, "Std. Error"][1]
#Print it
print(ci_lower_int)## (Intercept)
## -0.3514894
## (Intercept)
## 3.715756
#Estimate X
ci_lower_est <- model1$coefficients[2] - 1.96 * summary(model1)$coef[, "Std. Error"][2]
ci_upper_est <- model1$coefficients[2] + 1.96 * summary(model1)$coef[, "Std. Error"][2]
#Print it
print(ci_lower_est)## x
## 1.221644
## x
## 1.857071
5.3 Multivariate Regression
The world is a complex place and of course if we have a dependent variable Y, let us say for example income, we cannot explain it exclusively by the years of education or exclusively by the profession or any other single factor. Rather it is plausible that all these factors matter. The Multivariate Linear Regression Model allows us that we explain the variation in our dependent variable Y with multiple independent variables X. Mathematically, the systematic component changes like this:
\[ Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + ... + \beta_kX_{ki} + \epsilon_i \]
where
the index i runs over the observations (Respondents, Countries,…), i = 1,…,n
\(Y_i\) is the dependent variable, the variable we want to explain
\(X_{1i}\) is the first independent variable or explanatory variable.
\(X_{2i}\) is the second independent variable or explanatory variable.
\(X_{ki}\) is the k-th independent variable or explanatory variable, where k is the number of all our independent variables in our model.
\(\beta_0\) is the intercept of the regression line
\(\beta_1\) is the slope of the regression line of the first explanatory variable.
\(\beta_2\) is the slope of the regression line of the second explanatory variable.
\(\beta_k\) is the slope of the regression line of the k-th explanatory variable, where k is the number of all our independent variables in our model.
\(\epsilon_i\) is the error term, thus how our observed data differs from actual population data (e.g. Measurement Error).
In R, we can implement a multivariate model very easy with the already known lm() function. Let us run a multiple regression with our independent Variable X and our categorical Variable Z, the systematic component for this model looks like this:
\[ Y_i = \beta_0 + \beta_1X_i + \beta_2Z_i + \epsilon_i \]
Let us compute the slopes:
##
## Call:
## lm(formula = y ~ x + categorical_variable, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5183 -1.3840 -0.5014 1.2393 5.5808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9799 1.0679 1.854 0.0747
## x 1.5557 0.1621 9.596 3.42e-10
## categorical_variable1 -1.0667 0.9637 -1.107 0.2781
##
## (Intercept) .
## x ***
## categorical_variable1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.533 on 27 degrees of freedom
## Multiple R-squared: 0.7734, Adjusted R-squared: 0.7566
## F-statistic: 46.07 on 2 and 27 DF, p-value: 1.982e-09
We can see, that this works fine, we only had to add the independent variable in the command such as in our systematic component. Regarding the interpretation it is analogous to the normal linear regression one. For a one-unit increase in our independent variable x, the dependent variable y increases 0.67 units on average, holding all else constant. For category 1 in comparison to category 0, the dependent variable y increases 0.52 units on average, holding all else equal. Now, we can put the calculated coefficients into our systematic component and make predictions:
\[ Y_i = -0.54 + 0.67X_i + 0.52Z_i + \epsilon_i \]
If we want to visualize this, we would need a three-dimensional coordinate system. Why? Because every independent variable is one-dimension if you want. I said that k is the number of our independent variables, technically it is the number of dimensions our model has. I am not a fan of plotting graphs more than three dimensions and I do not recommend it to you either.
5.4 Categorical Variables
As I mentioned in the first chapter, there are different types of variables, let us reconsider them:
| Numeric | Numbers | c(1, 2.4, 3.14, 4) |
| Character | Text | c("1", "blue", "fun", "monster") |
| Logical | True or false | c(TRUE, FALSE, TRUE, FALSE) |
| Factor | Category | c("Strongly disagree", "Agree", "Neutral") |
Dependent variables must be numeric, when conducting linear regression. However, independent variables can take be scaled differently. Numeric variables are the easiest case, the interpretation is as mentioned in the previous examples. But categorical variables are differently to interpret. Let us inspect the categorical variable in our dataset, called categorical_variable:
##
## 0 1
## 19 11
As we can see our data set now contains a categorical variable with two categories, named “A” and “B”. Those categories could be anything: Female and Male, bought a product or did not buy a product, vaccine or placebo and so on. What happens, when we now run a model?
#running a model with a categorical variable
model2 <- lm(y ~ categorical_variable,
data = df)
#Getting the summary
summary(model2) ##
## Call:
## lm(formula = y ~ categorical_variable, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2770 -3.7683 0.1863 4.0996 10.2377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.5765 1.1985 8.825 1.41e-09
## categorical_variable1 -0.2265 1.9792 -0.114 0.91
##
## (Intercept) ***
## categorical_variable1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.224 on 28 degrees of freedom
## Multiple R-squared: 0.0004673, Adjusted R-squared: -0.03523
## F-statistic: 0.01309 on 1 and 28 DF, p-value: 0.9097
We see a coefficient of 0.64 that is fine. But as you see the category “category_A” is not display, why? Categorical variables are calculated based on so-called “reference categories”. R determines the reference category based on alphabetical order. In this case the category “A” is the reference category. The reference category is named like this, since the computed coefficients refer to it. The reference category “A” takes on the value 0. The coefficient of the category B is “0.92”. That means that category “B” if the a respondent is part of category “B” than the dependent variable Y increases on average 0.64 units in comparison to category “A”, holding all else equal.
That sounds technocratic, let us get an intuition. We add a zero to our code and look at the output:
#running the model
model3 <- lm(y ~ categorical_variable + 0,
data = df)
#getting a summary
summary(model3)##
## Call:
## lm(formula = y ~ categorical_variable + 0, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2770 -3.7683 0.1863 4.0996 10.2377
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## categorical_variable0 10.576 1.198 8.825 1.41e-09
## categorical_variable1 10.350 1.575 6.571 3.99e-07
##
## categorical_variable0 ***
## categorical_variable1 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.224 on 28 degrees of freedom
## Multiple R-squared: 0.8122, Adjusted R-squared: 0.7987
## F-statistic: 60.53 on 2 and 28 DF, p-value: 6.812e-11
The coefficients changed, but not really. Let us subtract the coefficient from category A from category B:
#results
result <- coefficients(model3)[2] - coefficients(model3)[1] #coefficents can be extracted this way
#printing it
result## categorical_variable1
## -0.2264615
We get the same coefficient as above and that is what R does automatically when computing coefficients of categorical variables. The reference category is scaled to 0 and the other categories are following.
5.4.1 Interaction Effects
One technique, which is important and widely used in statistics are interaction effects. To keep things simple concentrate on our independent variable X and our categorical variable Z. As mentioned, Z shows us the coefficients for our categories “A” and “B”. But what if we could further investigate the dynamics of the group? Interaction effects allow us to multiply our variable X by our categorical variable Z. Mathematically our systematic component looks like this:
\[ Y_i = \beta_0 + \beta_1X_i + \beta_2Z_i + \beta_3X_i*Z_i + e_i \]
where
the index i runs over the observations (Respondents, Countries,…), i = 1,…,n
\(Y_i\) is the dependent variable, the variable we want to explain
\(X_i\) is the independent variable or explanatory variable.
\(Z_i\) is our categorical variable
\(\beta_0\) is the intercept of the regression line
\(\beta_1\) is the slope of the regression line
\(\epsilon_i\) is the error term, thus how our observed data differs from actual population data (e.g. Measurement Error).
In the following, I will show you an example how to use interaction effects. Let’s say that I conducted a survey among 1000 respondents and asked them about how many hours they spent on R online courses such as this one. Then I asked about their coding ability. Lastly, I asked if they learned with this course or with other courses (0 = other courses, 1 = this course). Now we want to find out if spending more hours on a course lead to a higher coding ability in R. Furthermore, we want to find out if this course in comparison to other courses lead to a higher ability. To do so, we interact hours spent on courses with the variable indicating if the respondent worked through other courses or this course.
- Interactions effects can be implemented in R, by simply writing it explicitly into the function
lm(), either with an asterisks*or a double point:, let us have a look it:
#Setting seed for reproducibility
set.seed(123)
#Generate hours spent on a course
hours_spent <- runif(100, min = 0, max = 10)
#Generate the course dummy (0 = other courses, 1 = this course)
this_course = sample(c(0, 1), 100, replace = TRUE)
#Generate y with interaction effect
coding_ability <- 2 + 0.5 * hours_spent + 0 * this_course +
1.5 * hours_spent * this_course + rnorm(100)
#Create a data frame
df_int <- data.frame(hours_spent, this_course, coding_ability)
#Fit the interaction model
model_interaction <- lm(coding_ability ~ hours_spent * this_course, data = df_int)
#Summarizing models
summary(model_interaction)##
## Call:
## lm(formula = coding_ability ~ hours_spent * this_course, data = df_int)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9122 -0.7295 -0.0765 0.5901 3.3882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.82617 0.32506 5.618 1.88e-07
## hours_spent 0.52184 0.05380 9.699 6.59e-16
## this_course 0.03336 0.41030 0.081 0.935
## hours_spent:this_course 1.47571 0.07068 20.877 < 2e-16
##
## (Intercept) ***
## hours_spent ***
## this_course
## hours_spent:this_course ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9799 on 96 degrees of freedom
## Multiple R-squared: 0.9696, Adjusted R-squared: 0.9687
## F-statistic: 1022 on 3 and 96 DF, p-value: < 2.2e-16
Interaction effects on their own are not intuitive. To get an intuition we have to graphically plot it. The sjPlot package offers the plot_model() command, which automatically plots so called predicted probabilities. It would be too much to go into detail about predicted probabilities. What is more important is, that we get a graph, which shows the effect of of the interaction:
We call
plot_model()and include our model with the interaction effectmodel_interactionThen we have to call the type and set it to
type="int", which explicitly plots interaction effects.
plot_model(model_interaction, type = "int") +
scale_x_continuous(breaks = seq(0,10, 1)) +
labs(title = "Coding Ability after this Course in Comparison",
x = "Hours spent",
y = "Coding Ability in R") +
scale_color_manual(
values = c("red", "blue"),
labels = c("Other Courses", "This Course")
) +
theme_sjplot() +
theme(legend.position = "bottom",
legend.title = element_blank())On the x-axis, we have our independent variable and on the y-axis we do have our dependent variable. So far, so normal. But we see two lines: One line for all observations in category “Other Courses” (red line) and one line for all observations with “This Course” (blue line). That is exactly what an interaction does. For every point in our independent variable a probability for y is computed in one category, and one line for every observation in the other category.
To interpret it we follow two steps: First, looking at the direction of the lines. We can see that X has a positive effect on Y. Both lines are increasing with higher X-values. Now, the interaction allows us to compare the effect from X on Y for all groups of the categorical variable. In our example that means, that observations in group 1 display a higher effect than observations in group 0. That could be for several reasons.
Let us fill this example with life: We are studying the effect of study hours (x) on exam scores (y) for two different groups of students: those who attend a preparatory course (group 1) and those who do not (group 0).
Interpretation:
- Direction of the Lines:
- Both lines (for group 0 and group 1) are increasing, indicating that more hours spent on courses lead to higher coding ability in R, on average.
- Effect Comparison:
The slope for group 1 ( attended this course) is steeper compared to group 0 (attended to other courses).
This means that for students who attended this course, each additional hour of study has a larger positive impact on their coding ability in R compared to those who did attend another course
By visualizing and analyzing the interaction effect, we can draw meaningful conclusions about how different factors (like attending a preparatory course) modify the relationship between study efforts and performance outcomes.
5.5 Outlook Data Analysis
This chapter was an introduction to hypothesis testing and basic statistical applications. Further, you were introduced in the most popular and important model, linear regression. The chapter showed not only how linear regression in R is computed, but also how to check if effects are robust and how the models fits. Also extensions to linear regression, namely multivariate regression and categorical variable were introduced.
Linear regression is the basis of nearly everything. Advanced modelling of different classes of dependent variables to even machine learning techniques are based on linear regression. This was the reason, I did not show just linear regression, but also how vulnerable the model is. Poor modelling will always lead to poor results, so we have to aim to check how good our model fits our data and thus research interest.
Further Links:
- A course I can recommend if you want to have a detailed deep dive into statistics is “Introduction to Econometrics with R” by Christoph Hanck, Martin Arnold, Alexander Gerber, and Martin Schmelzer.
5.6 Exercises Data Analysis
To understand linear regression, we do not have to load any complicated data. Let us assume, you are a market analyst and your customer is the production company of a series called “Breaking Thrones”. The production company wants to know how the viewers of the series judge the final episode. You conduct a survey and ask people on how satisfied they were with the season final and some social demographics. Here is your codebook:
| Variable | Description |
|---|---|
| id | The id of the respondent |
| satisfaction | The answer to the question “How satisfied were you with the final episode of Breaking Throne?”, where 0 is completely dissatisfied and 10 completely satisfied |
| age | The age of the respondent |
| female | The Gender of the respondent, where 0 = Male, 1 = Female |
Let us generate the data:
#setting seed for reproduciability
set.seed(123)
#Generating the data
final_BT <- data.frame(
id = c(1:10),
satisfaction = round(rnorm(10, mean = 6, sd = 2.5)),
age = round(rnorm(10, mean = 25, sd = 5)),
female = rbinom(10, 1, 0.5)
)
print(final_BT)## id satisfaction age female
## 1 1 5 31 0
## 2 2 5 27 0
## 3 3 10 27 0
## 4 4 6 26 0
## 5 5 6 22 0
## 6 6 10 34 0
## 7 7 7 27 0
## 8 8 3 15 0
## 9 9 4 29 0
## 10 10 5 23 1
5.6.1 Exercise 1: Linear Regression with two variables
You want to know if age has an impact on the satisfaction with the last episode. You want to conduct a linear regression.
a. Calculate \(\beta_0\) and \(\beta_1\) by hand
b. Calculate \(\beta_0\) and \(\beta_1\) automatically with R
c. Interpret all quantities of your result: Standard Error, t-statistic, p-value, confidence intervals and the \(R^2\).
d. Check for influential outliers