Heart Disease Prediction and Analysis

December 5, 2024

Introduction

The Heart Failure Prediction Dataset contains 918 patient records with 11 clinical features and one binary label: HeartDisease. The goal is to use these features to predict susceptibility to heart failure.

Feature Overview

Features:

  • Age: Age in years
  • Sex: M (Male), F (Female)
  • ChestPainType: TA, ATA, NAP, ASY
  • RestingBP: Resting blood pressure (mm Hg)
  • Cholesterol: Serum cholesterol (mg/dl)
  • FastingBS: 1 if FastingBS > 120 mg/dl, else 0
  • RestingECG: Normal, ST, LVH
  • MaxHR: Max heart rate achieved (60–202)
  • ExerciseAngina: Y (Yes), N (No)
  • Oldpeak: Depression
  • ST_Slope: Up, Flat, Down

Label:

  • HeartDisease: 1 = heart disease, 0 = normal

Descriptive Statistics

Categorical and Binary Variables

VariableTypeProportion
SexBinaryM = 79%, F = 21%
FastingBSBinary0 = 77%, 1 = 23%
ExerciseAnginaBinaryY = 40%, N = 60%
HeartDiseaseBinary0 = 45%, 1 = 55%
ChestPainTypeCategoricalASY = 54%, ATA = 19%, NAP = 22%, TA = 5%
RestingECGCategoricalLVH = 20%, Normal = 60%, ST = 20%
ST_SlopeCategoricalDown = 7%, Flat = 50%, Up = 43%

Numeric Variables

VariableTypeMeanMedianSDMinMaxQ1Q3IQR
AgeInteger53.51549.432877476013
RestingBPInteger132.413018.51020012014020
CholesterolInteger198.8223109.40603173.2526793.75
MaxHRInteger136.8113825.466020212015636
OldpeakContinuous0.890.61.07-2.66.201.51.5

Data Cleaning

KNN Imputation

Cholesterol had no missing values but contained invalid zeros. We used KNN imputation to replace these, retaining a binary column MissingCholesterolNum to indicate where zeros originally appeared.

Before Imputation
Figure: Cholesterol distribution before KNN imputation

After Imputation
Figure: Cholesterol distribution after KNN imputation

Categorical Encoding

  • One-hot encoding for RestingECG and ExerciseAngina
  • Ordinal encoding for ST_Slope
  • Removed one dummy per encoded group to prevent multicollinearity

Hypothesis Testing

ANOVA on Cholesterol

ANOVA across ChestPainType:

  • F = 1.95
  • p = 0.119

→ No significant difference in cholesterol means across pain types.

ANOVA Boxplot

Pairwise Proportion Tests

Tested binary features vs heart disease prevalence using:

Estimator: p̂₀ - p̂₁ H₀: p̂₀ = p̂₁ H₁: p̂₀ ≠ p̂₁

Proportion Test

Model Development

We trained logistic regression using:

P(y=1x)=11+e(wx+b)P(y=1 \mid \mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}

Models Trained

  1. Entire Features Model
  2. Pruned Features Model
  3. Pruned Features + Outliers Removed

Assumptions Checked

  • ✅ Independence of observations
  • ✅ Linearity in logit
  • ✅ No multicollinearity (handled with PCA on correlated features)
  • ✅ Outliers removed using Cook's distance > 4/n

VIF Chart
Cook's Distance

Model Results

  • All models performed well
  • Pruned + Outlier model had best balance of precision/recall
  • Most significant features:
    • Sex, FastingBS, ExerciseAngina, Oldpeak
    • ST_Slope, ChestPainType_ATA, ChestPainType_TA
    • MissingCholesterolNum, Constant

Coefficients
Model Metrics
ROC Curve

Conclusion

This project demonstrated a strong pipeline for heart disease prediction via logistic regression. Key takeaways:

  • KNN imputation handled missing/zero values
  • Statistical testing helped evaluate relationships
  • Feature selection + outlier removal improved model performance
  • Logistic regression is a strong baseline with interpretable results

Visualizations

Continuous Features

MaxHR, Oldpeak, Age, RestingBP, Cholesterol
(See histograms and boxplots)

Categorical Features

ChestPainType, ExerciseAngina, FastingBS, ST_Slope, RestingECG, Sex

Label

Heart Disease Distribution