[LDA & QDA] Practicing LDA and QDA for diabetes classification with Python

2 minute read

Practicing LDA and QDA for diabetes classification with Python

(1) Importing modules and dataset

import pandas as pd
import numpy as np
import os
os.environ['KAGGLE_USERNAME']="jisuleeoslo"
os.environ['KAGGLE_KEy']=""
!kaggle datasets download -d mathchi/diabetes-data-set
diabetes-data-set.zip: Skipping, found more recently modified local copy (use --force to force download)
import zipfile
with zipfile.ZipFile("diabetes-data-set.zip","r") as zip_ref:
    zip_ref.extractall()


About data:

  • Following description is from https://www.kaggle.com/mathchi/diabetes-data-set
  • This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.
  • Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
  • Number of Instances: 768
  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 or 1)


df = pd.read_csv('diabetes.csv')
df
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
... ... ... ... ... ... ... ... ... ...
763 10 101 76 48 180 32.9 0.171 63 0
764 2 122 70 27 0 36.8 0.340 27 0
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0

768 rows × 9 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


(2) Dividing the data into train and test dataset

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6 148 72 35 0 33.6 0.627 50
1 1 85 66 29 0 26.6 0.351 31
2 8 183 64 0 0 23.3 0.672 32
3 1 89 66 23 94 28.1 0.167 21
4 0 137 40 35 168 43.1 2.288 33
y.head()
0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


(3) Applying LDA

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
clf.fit(X_train, y_train)
LinearDiscriminantAnalysis()

(4) Evaluating classification performance of LDA

from sklearn.metrics import classification_report
pred=clf.predict(X_test)
print(classification_report(y_test, pred))
              precision    recall  f1-score   support

           0       0.79      0.90      0.84        99
           1       0.76      0.56      0.65        55

    accuracy                           0.78       154
   macro avg       0.77      0.73      0.74       154
weighted avg       0.78      0.78      0.77       154

(5) Applying QDA

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
clf2 = QuadraticDiscriminantAnalysis()
clf2.fit(X_train, y_train)
QuadraticDiscriminantAnalysis()

(6) Evaluating classification performance of QDA

pred2 = clf2.predict(X_test)
print(classification_report(y_test, pred2))
              precision    recall  f1-score   support

           0       0.78      0.84      0.81        99
           1       0.66      0.56      0.61        55

    accuracy                           0.74       154
   macro avg       0.72      0.70      0.71       154
weighted avg       0.73      0.74      0.74       154
  • In this case, LDA yields slightly better performance than QDA


Reference

Updated: