<- projects/

$ cat cardiovascular-disease-prediction.md

Cardiovascular Disease Prediction

[github] [api docs] live

# Overview

A machine learning project that predicts cardiovascular disease using patient health data, with a focus on model interpretability and clinical relevance. Trained on 70,000 patient records from the Kaggle Cardiovascular Disease Dataset.

# Tech Stack

-> Python

-> XGBoost

-> scikit-learn

-> SHAP

-> FastAPI

-> pandas

# Key Findings

-> Systolic blood pressure is the dominant predictor (0.43 correlation, 55% feature importance)

-> Age and cholesterol are the next most impactful features

-> Lifestyle factors (smoking, alcohol, activity) showed weak predictive power. Their effects are already mediated through blood pressure

-> 73% accuracy aligns with published benchmarks for this dataset

# Model Performance

Model Accuracy Precision Recall
XGBoost 73% 0.75 0.69

# Try It

Enter patient data below to get a cardiovascular disease risk prediction from the live API.

$ curl -X POST /predict -d '{...}'

# Analysis

## Correlation Heatmap

Correlation heatmap showing relationships between patient features

Systolic blood pressure (ap_hi) shows the strongest correlation with cardiovascular disease (0.43), followed by diastolic BP (ap_lo, 0.34) and age (0.24). Lifestyle factors like smoking and alcohol show surprisingly weak direct correlations. Their effects are likely mediated through blood pressure. Notable feature redundancy exists: weight and bmi correlate at 0.86, confirming BMI captures weight information effectively.

## Features vs Target

Feature distributions compared against cardiovascular disease target

Patients with cardiovascular disease show higher systolic blood pressure (median ~130 vs ~115), are slightly older (median ~55 vs ~52), and have marginally higher BMI. However, significant overlap between the two groups across all features explains the model's 73% accuracy ceiling. No single feature cleanly separates healthy from at-risk patients.

## SHAP Feature Importance

SHAP summary plot showing feature importance and impact direction

SHAP analysis reveals that high systolic blood pressure is the strongest driver of cardiovascular risk predictions, with values above 140 mmHg strongly pushing toward a positive diagnosis. Age and cholesterol follow as the next most impactful features. Notably, physical activity, smoking, and alcohol have minimal direct impact on predictions. This is consistent with the hypothesis that their effects are already captured by blood pressure and cholesterol levels.

$ _