[Linear Regression] Applying simple and multiple linear regression with R script

2 minute read

How to apply simple linear regression model and interpret results

(1) Opening the file

  heights = read.csv('data/heights.csv')
  head(heights)

##     father      son
## 1 165.2232 151.8368
## 2 160.6574 160.5637
## 3 164.9865 160.8897
## 4 167.0113 159.4926
## 5 155.2886 163.2741
## 6 160.0773 163.1752

There are two numerical variables: father’s height and son’s height. Let’s set the former as independent variable(x) while the latter is dependent variable(y).

(2) Drawing a scatterplot to visually display the relationship between variables

  plot(heights$father, heights$son, pch=16, col='#3377BB77')+
  abline(v=mean(heights$father), lty=2)+
  abline(h=mean(heights$son),lty=2)

(3) Using lm( ), fitting simple linear regression model to the data

  lm_heights = lm(son ~ father, data=heights)
  summary(lm_heights)

## 
## Call:
## lm(formula = son ~ father, data = heights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.5480  -3.8466  -0.0201   4.1364  22.7799 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 86.07198    4.65418   18.49   <2e-16 ***
## father       0.51409    0.02705   19.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.189 on 1076 degrees of freedom
## Multiple R-squared:  0.2513, Adjusted R-squared:  0.2506 
## F-statistic: 361.2 on 1 and 1076 DF,  p-value: < 2.2e-16
  • According to the estimate part from the results, the value of β0 is 86.07 and β1 is 0.514 where the simple linear regression equation is Y = β0 + β1X
  • The coefficient of determination (R2) is 0.25. It means that 25 percent of the variance in the dependent variable(son’s height) can be attributed to the independent variable(dad’s height).


How to apply multiple linear regression model and interpret results

(1) Opening the file

  sales = read.csv('data/sales.csv')
  head(sales)

##   sales price advert
## 1  73.2  5.69    1.3
## 2  71.8  6.49    2.9
## 3  62.4  5.63    0.8
## 4  67.4  6.22    0.7
## 5  89.3  5.02    1.5
## 6  70.3  6.41    1.3

  nrow(sales) 

## [1] 75

There three numerical variables: sales, price of products and advertising price. Let’s set sales as dependent variable(y) while the other two are independent variable(x1 and x2).

(2) Drawing scatterplots to visually display the relationships between variables

  • The scatterplot of price and sales

    plot(sales$price, sales$sales, pch=16, col='#3377BB77')
    
  • The scatterplot of advert and sales

    plot(sales$advert, sales$sales, pch=16, col='#3377BB77')
    

(3) Using lm( ), fitting multiple linear regression model to the data

  lm_sales = lm(sales ~ price+advert, data=sales) # '+' mark for multiple independent variables 
  summary(lm_sales)

## 
## Call:
## lm(formula = sales ~ price + advert, data = sales)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4825  -3.1434  -0.3456   2.8754  11.3049 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 118.9136     6.3516  18.722  < 2e-16 ***
## price        -7.9079     1.0960  -7.215 4.42e-10 ***
## advert        1.8626     0.6832   2.726  0.00804 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.886 on 72 degrees of freedom
## Multiple R-squared:  0.4483, Adjusted R-squared:  0.4329 
## F-statistic: 29.25 on 2 and 72 DF,  p-value: 5.041e-10
  • According to the estimate part from the results, the regression equation is as follows:

    sales = 118.9136 -7.9079sales +1.8626advert

  • The coefficient of determination (R2) is 0.4483. It means that about 45 percent of the variance in sales can be attributed to product price and advertising price.

Reference

  • 패스트 캠퍼스 데이터 분석 입문 올인원 패키지 강의

Updated: