Heart Disease Prediction and Analysis

Introduction

The Heart Failure Prediction Dataset contains 918 patient records with 11 clinical features and one binary label: HeartDisease. The goal is to use these features to predict susceptibility to heart failure.

Feature Overview

Features:

Age: Age in years
Sex: M (Male), F (Female)
ChestPainType: TA, ATA, NAP, ASY
RestingBP: Resting blood pressure (mm Hg)
Cholesterol: Serum cholesterol (mg/dl)
FastingBS: 1 if FastingBS > 120 mg/dl, else 0
RestingECG: Normal, ST, LVH
MaxHR: Max heart rate achieved (60–202)
ExerciseAngina: Y (Yes), N (No)
Oldpeak: Depression
ST_Slope: Up, Flat, Down

Label:

HeartDisease: 1 = heart disease, 0 = normal

Descriptive Statistics

Categorical and Binary Variables

Variable	Type	Proportion
Sex	Binary	M = 79%, F = 21%
FastingBS	Binary	0 = 77%, 1 = 23%
ExerciseAngina	Binary	Y = 40%, N = 60%
HeartDisease	Binary	0 = 45%, 1 = 55%
ChestPainType	Categorical	ASY = 54%, ATA = 19%, NAP = 22%, TA = 5%
RestingECG	Categorical	LVH = 20%, Normal = 60%, ST = 20%
ST_Slope	Categorical	Down = 7%, Flat = 50%, Up = 43%

Numeric Variables

Variable	Type	Mean	Median	SD	Min	Max	Q1	Q3	IQR
Age	Integer	53.51	54	9.43	28	77	47	60	13
RestingBP	Integer	132.4	130	18.51	0	200	120	140	20
Cholesterol	Integer	198.8	223	109.4	0	603	173.25	267	93.75
MaxHR	Integer	136.81	138	25.46	60	202	120	156	36
Oldpeak	Continuous	0.89	0.6	1.07	-2.6	6.2	0	1.5	1.5

Data Cleaning

KNN Imputation

Cholesterol had no missing values but contained invalid zeros. We used KNN imputation to replace these, retaining a binary column MissingCholesterolNum to indicate where zeros originally appeared.

Before Imputation
Figure: Cholesterol distribution before KNN imputation

After Imputation
Figure: Cholesterol distribution after KNN imputation

Categorical Encoding

One-hot encoding for RestingECG and ExerciseAngina
Ordinal encoding for ST_Slope
Removed one dummy per encoded group to prevent multicollinearity

Hypothesis Testing

ANOVA on Cholesterol

ANOVA across ChestPainType:

F = 1.95
p = 0.119

→ No significant difference in cholesterol means across pain types.

ANOVA Boxplot

Pairwise Proportion Tests

Tested binary features vs heart disease prevalence using:

Estimator: p̂₀ - p̂₁ H₀: p̂₀ = p̂₁ H₁: p̂₀ ≠ p̂₁

Proportion Test

Model Development

We trained logistic regression using:

P(y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}

Models Trained

Entire Features Model
Pruned Features Model
Pruned Features + Outliers Removed

Assumptions Checked

✅ Independence of observations
✅ Linearity in logit
✅ No multicollinearity (handled with PCA on correlated features)
✅ Outliers removed using Cook's distance > 4/n

VIF Chart
Cook's Distance

Model Results

All models performed well
Pruned + Outlier model had best balance of precision/recall
Most significant features:
- Sex, FastingBS, ExerciseAngina, Oldpeak
- ST_Slope, ChestPainType_ATA, ChestPainType_TA
- MissingCholesterolNum, Constant

Coefficients
Model Metrics
ROC Curve

Conclusion

This project demonstrated a strong pipeline for heart disease prediction via logistic regression. Key takeaways:

KNN imputation handled missing/zero values
Statistical testing helped evaluate relationships
Feature selection + outlier removal improved model performance
Logistic regression is a strong baseline with interpretable results

Visualizations

Continuous Features

MaxHR, Oldpeak, Age, RestingBP, Cholesterol
(See histograms and boxplots)

Categorical Features

ChestPainType, ExerciseAngina, FastingBS, ST_Slope, RestingECG, Sex

Label

Heart Disease Distribution