[R] Practicing linear regression and decision tree with insurance cost data

3 minute read

Practicing linear regression and decision tree with insurance cost data(from Kaggle)

(1) Downloading the data

Medical Cost Personal Datasets : https://www.kaggle.com/mirichoi0218/insurance

(2) Opening the downloaded csv file

  insurance = read.csv('data/insurance.csv')

(3) Using summary( ), to get descriptive statistics

  summary(insurance)

##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770
  • There are four numerical data: age, BMI, children and charges.
  • There are three categorical data: sex, smoker and region.

(4) Making histograms with numerical data and tables with categorical data

  hist(insurance$age)
  hist(insurance$bmi)
  hist(insurance$children)
  table(insurance$sex)

## 
## female   male 
##    662    676

  table(insurance$smoker)

## 
##   no  yes 
## 1064  274

  table(insurance$region)

## 
## northeast northwest southeast southwest 
##       324       325       364       325

(5) Numerical data - making a scatterplot and calculating correlation coefficient between BMI and charges

  plot(insurance$bmi, insurance$charges, pch=16)
cor(insurance$bmi, insurance$charges)

## [1] 0.198341
  • A weak and suspicious positive correlation is found. This alone cannot guarantee us very much about the correlation.

(6) Categorical data - making boxplots of charges by regions, and calculating average charges by regions

  boxplot(charges ~ region, data=insurance)
  aggregate(charges ~ region, data=insurance, mean)

##      region  charges
## 1 northeast 13406.38
## 2 northwest 12417.58
## 3 southeast 14735.41
## 4 southwest 12346.94
  • The average charges in southeast region is higher than the others.
  • In the boxplot, the median values of insurance charges in each region look pretty much similar, yet the top 50 percent of insurance charges in southeast are more expensive than those of the others.

(7) Applying multiple linear regression when the variable of interest variable charges and the others are explanatory variables

  lm_ins = lm(charges ~ ., data=insurance)
  summary(lm_ins)

## 
## Call:
## lm(formula = charges ~ ., data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16
  • In the result, for example, when every one year increase in age, the model predicts an increase of 256.9 dollars.
  • Under the same conditions but when a patient is a male, the model predicts a decrease of 131.3 dollars.

(8) Applying decision tree when the variable of interest variable charges and the others are explanatory variables

  library(rpart)
  tree_ins = rpart(charges ~ ., data=insurance)
  tree_ins

## n= 1338 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 1338 196074200000 13270.420  
##   2) smoker=no 1064  38188720000  8434.268  
##     4) age< 42.5 596  13198540000  5398.850 *
##     5) age>=42.5 468  12505450000 12299.890 *
##   3) smoker=yes 274  36365600000 32050.230  
##     6) bmi< 30.01 130   3286655000 21369.220 *
##     7) bmi>=30.01 144   4859010000 41692.810 *

  library(rpart.plot)
  rpart.plot(tree_ins)
  • Depending on a patient’s smoking status, the data is divided into 8: 2 with a large difference in insurance costs. Smokers pay a lot more than non smokers, and whose BMI level is over 30(falling within the obesity range) pay more than the rest of the smokers.
  • Meanwhile non-smokers have difference in insurance costs depending on whether they are over 43 years or not.

Reference

  • 패스트 캠퍼스 데이터 분석 입문 올인원 패키지 강의

Updated: