An explanation simple linear regression with R script

1 minute read

1. How to calculate regression coefficient in simple linear regression?

  • In simple linear regression a linear function is used to explain the relationship between one independent variable(x) and a dependent variable(y) as accurate as possible, and predict unseen dependent values when certain independent values are given. The independent variable is single and continuous”

  • The formula of simple linear regression is Y = β0  + β1X when X is a given value of an independent variable, β0 and β1 are regression coefficient, and Y is a predicting value of a dependent variable.

  • The most used method for calculating regression coefficient is least squares method.

    • X͞ is the average of Xi and Y͞ is the average of Yi

      rXY is the correlation coefficient between X and Y

      SX and SY is the standard deviation of X and Y

2. R script for understanding simple linear regression

  • When data include 1,078 samples of father(X)-son(Y)’s heights, following codes are to draw trend lines representing the averages of X and Y and find a linear function by calculating regression coefficient to make a prediction.

(1) Importing data

heights = read.csv('data/heights.csv')
head(heights)

##     father      son
## 1 165.2232 151.8368
## 2 160.6574 160.5637
## 3 164.9865 160.8897
## 4 167.0113 159.4926
## 5 155.2886 163.2741
## 6 160.0773 163.1752

tail(heights)

##        father      son
## 1073 171.9747 151.9350
## 1074 170.1719 179.7109
## 1075 181.1828 173.4001
## 1076 182.3292 176.0370
## 1077 179.6755 176.0271
## 1078 178.5775 170.2181

(2) Drawing a scatterplot and trend lines

plot(heights, pch=16, col='#3377BB77')
abline(v=mean(heights$father), lty=2)
abline(h=mean(heights$son),lty=2)

(3) Calculating regression coefficient

r_xy = cor(heights$father, heights$son)
r_xy

## [1] 0.5013383

sd_x = sd(heights$father)
sd_y = sd(heights$son)

b1 = r_xy/sd_x*sd_y
b1

## [1] 0.514093

b0 = mean(heights$son) - b1*mean(heights$father)
b0

## [1] 86.07198

plot(heights, pch=16, col='#3377BB77')
abline(v=mean(heights$father), lty=2)
abline(h=mean(heights$son),lty=2)
abline(a=b0, b=b1, col='red', lwd=2)

The red line represents the relationship between dad-son’s heights

(4) Prediction applying the given regression coefficient

b0 + b1*175 #When dad's height is 175

## [1] 176.0383

b0 + b1*190 #When dad's height is 190

## [1] 183.7497

Reference

  • 패스트캠퍼스 데이터 분석 입문 올인원 패키지 강의

Updated: