[LDA & QDA] Practicing LDA and QDA for diabetes classification with Python
Practicing LDA and QDA for diabetes classification with Python
(1) Importing modules and dataset
import pandas as pd
import numpy as np
import os
os.environ['KAGGLE_USERNAME']="jisuleeoslo"
os.environ['KAGGLE_KEy']=""
!kaggle datasets download -d mathchi/diabetes-data-set
diabetes-data-set.zip: Skipping, found more recently modified local copy (use --force to force download)
import zipfile
with zipfile.ZipFile("diabetes-data-set.zip","r") as zip_ref:
zip_ref.extractall()
About data:
- Following description is from https://www.kaggle.com/mathchi/diabetes-data-set
- This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.
- Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
- Number of Instances: 768
- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure (mm Hg)
- SkinThickness: Triceps skin fold thickness (mm)
- Insulin: 2-Hour serum insulin (mu U/ml)
- BMI: Body mass index (weight in kg/(height in m)^2)
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)
df = pd.read_csv('diabetes.csv')
df
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
768 rows × 9 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
(2) Dividing the data into train and test dataset
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 |
y.head()
0 1
1 0
2 1
3 0
4 1
Name: Outcome, dtype: int64
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
(3) Applying LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
clf.fit(X_train, y_train)
LinearDiscriminantAnalysis()
(4) Evaluating classification performance of LDA
from sklearn.metrics import classification_report
pred=clf.predict(X_test)
print(classification_report(y_test, pred))
precision recall f1-score support
0 0.79 0.90 0.84 99
1 0.76 0.56 0.65 55
accuracy 0.78 154
macro avg 0.77 0.73 0.74 154
weighted avg 0.78 0.78 0.77 154
(5) Applying QDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
clf2 = QuadraticDiscriminantAnalysis()
clf2.fit(X_train, y_train)
QuadraticDiscriminantAnalysis()
(6) Evaluating classification performance of QDA
pred2 = clf2.predict(X_test)
print(classification_report(y_test, pred2))
precision recall f1-score support
0 0.78 0.84 0.81 99
1 0.66 0.56 0.61 55
accuracy 0.74 154
macro avg 0.72 0.70 0.71 154
weighted avg 0.73 0.74 0.74 154
- In this case, LDA yields slightly better performance than QDA
Reference
- Kaggle Diabetes Data Set: https://www.kaggle.com/mathchi/diabetes-data-set