[Chi-squared test] Testing the independence of two categorical variables with R script

1 minute read

Testing the independence of categorical data

(1) Opening the file

  raw = read.csv('data/raw.csv')
  head(raw)

##   PROD   AGE
## 1    B AGE20
## 2    C AGE30
## 3    B AGE30
## 4    A AGE20
## 5    B AGE20
## 6    C AGE30

  nrow(raw)

## [1] 100

There are two categorical variables: products(A,B and C) and age(20 and 30)

(2) Making a cross tabulation and drawing a heatmap

  table_raw = table(raw$PROD, raw$AGE)
  table_raw

##    
##     AGE20 AGE30
##   A    30     0
##   B    20    20
##   C     0    30


  heatmap(table_raw,
          Rowv=NA, Colv=NA, scale='none', col=colorRampPalette(c('ivory','red'))(100), revC = TRUE)

The heatmap shows that people in 20 tend to buy A product while the people in 30 are likely to buy product. Then, let’s do chi-squared test to figure out the two categorical variables are not independent.

(3) Using chisq.test( ), testing the independence of two variables

  chisq.test(table_raw)

## 
##  Pearson's Chi-squared test
## 
## data:  table_raw
## X-squared = 60, df = 2, p-value = 9.358e-14

As the P-value is very small(9.358e-14), the null hypothesis stating that the two variable(product and age) are independent is rejected. Thus, we can say product and age are not independent but related.

Reference

  • 패스트 캠퍼스 데이터 분석 입문 올인원 패키지 강의

Updated: