[LDA & QDA] Practicing LDA and QDA for diabetes classification with Python

2 minute read

Practicing LDA and QDA for diabetes classification with Python

(1) Importing modules and dataset

import pandas as pd
import numpy as np

import os

os.environ['KAGGLE_USERNAME']="jisuleeoslo"
os.environ['KAGGLE_KEy']=""

!kaggle datasets download -d mathchi/diabetes-data-set

diabetes-data-set.zip: Skipping, found more recently modified local copy (use --force to force download)

import zipfile
with zipfile.ZipFile("diabetes-data-set.zip","r") as zip_ref:
    zip_ref.extractall()

About data:

Following description is from https://www.kaggle.com/mathchi/diabetes-data-set
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.
Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Number of Instances: 768
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

df = pd.read_csv('diabetes.csv')

df

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1
...	...	...	...	...	...	...	...	...	...
763	10	101	76	48	180	32.9	0.171	63	0
764	2	122	70	27	0	36.8	0.340	27	0
765	5	121	72	23	112	26.2	0.245	30	0
766	1	126	60	0	0	30.1	0.349	47	1
767	1	93	70	31	0	30.4	0.315	23	0

768 rows × 9 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

(2) Dividing the data into train and test dataset

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age
0	6	148	72	35	0	33.6	0.627	50
1	1	85	66	29	0	26.6	0.351	31
2	8	183	64	0	0	23.3	0.672	32
3	1	89	66	23	94	28.1	0.167	21
4	0	137	40	35	168	43.1	2.288	33

y.head()

  1
  0
  1
  0
  1
Name: Outcome, dtype: int64

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

(3) Applying LDA

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

clf = LinearDiscriminantAnalysis()
clf.fit(X_train, y_train)

LinearDiscriminantAnalysis()

(4) Evaluating classification performance of LDA

from sklearn.metrics import classification_report
pred=clf.predict(X_test)
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.79      0.90      0.84        99
           1       0.76      0.56      0.65        55

    accuracy                           0.78       154
   macro avg       0.77      0.73      0.74       154
weighted avg       0.78      0.78      0.77       154

(5) Applying QDA

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

clf2 = QuadraticDiscriminantAnalysis()
clf2.fit(X_train, y_train)

QuadraticDiscriminantAnalysis()

(6) Evaluating classification performance of QDA

pred2 = clf2.predict(X_test)
print(classification_report(y_test, pred2))

              precision    recall  f1-score   support

           0       0.78      0.84      0.81        99
           1       0.66      0.56      0.61        55

    accuracy                           0.74       154
   macro avg       0.72      0.70      0.71       154
weighted avg       0.73      0.74      0.74       154

In this case, LDA yields slightly better performance than QDA

Reference

Kaggle Diabetes Data Set: https://www.kaggle.com/mathchi/diabetes-data-set

Share on

Twitter Facebook LinkedIn

Jleeoslo

[LDA & QDA] Practicing LDA and QDA for diabetes classification with Python

Practicing LDA and QDA for diabetes classification with Python

(1) Importing modules and dataset

(2) Dividing the data into train and test dataset

(3) Applying LDA

(4) Evaluating classification performance of LDA

(5) Applying QDA

(6) Evaluating classification performance of QDA

Share on

You may also enjoy

[ML] Classification with cilinical data: can we prevent heart failure through data analysis?

[Decision Tree] Experimenting and visualizing classification and regression trees with different depths

[Web-crawling] Collecting NRK online news articles using BeautifulSoup

[Decision Tree] Understanding Decision Trees and their recursive algorithms