[LDA & QDA] Practicing LDA and QDA for diabetes classification with Python
Practicing LDA and QDA for diabetes classification with Python
(1) Importing modules and dataset
import pandas as pd
import numpy as np
import os
os.environ['KAGGLE_USERNAME']="jisuleeoslo"
os.environ['KAGGLE_KEy']=""
!kaggle datasets download -d mathchi/diabetes-data-set
diabetes-data-set.zip: Skipping, found more recently modified local copy (use --force to force download)
import zipfile
with zipfile.ZipFile("diabetes-data-set.zip","r") as zip_ref:
    zip_ref.extractall()
About data:
- Following description is from https://www.kaggle.com/mathchi/diabetes-data-set
- This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.
- Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
- Number of Instances: 768
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)
df = pd.read_csv('diabetes.csv')
df
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 | 
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 | 
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 | 
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 | 
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 | 
| 764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 | 
| 765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 | 
| 766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 | 
| 767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 | 
768 rows × 9 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
(2) Dividing the data into train and test dataset
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 
y.head()
0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
(3) Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
clf.fit(X_train, y_train)
LinearDiscriminantAnalysis()
(4) Evaluating classification performance of LDA
from sklearn.metrics import classification_report
pred=clf.predict(X_test)
print(classification_report(y_test, pred))
              precision    recall  f1-score   support
           0       0.79      0.90      0.84        99
           1       0.76      0.56      0.65        55
    accuracy                           0.78       154
   macro avg       0.77      0.73      0.74       154
weighted avg       0.78      0.78      0.77       154
(5) Applying QDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
clf2 = QuadraticDiscriminantAnalysis()
clf2.fit(X_train, y_train)
QuadraticDiscriminantAnalysis()
(6) Evaluating classification performance of QDA
pred2 = clf2.predict(X_test)
print(classification_report(y_test, pred2))
              precision    recall  f1-score   support
           0       0.78      0.84      0.81        99
           1       0.66      0.56      0.61        55
    accuracy                           0.74       154
   macro avg       0.72      0.70      0.71       154
weighted avg       0.73      0.74      0.74       154
- In this case, LDA yields slightly better performance than QDA
Reference
- Kaggle Diabetes Data Set: https://www.kaggle.com/mathchi/diabetes-data-set
