$ cat cardiovascular-disease-prediction.md
Cardiovascular Disease Prediction
# Overview
A machine learning project that predicts cardiovascular disease using patient health data, with a focus on model interpretability and clinical relevance. Trained on 70,000 patient records from the Kaggle Cardiovascular Disease Dataset.
# Tech Stack
-> Python
-> XGBoost
-> scikit-learn
-> SHAP
-> FastAPI
-> pandas
# Key Findings
-> Systolic blood pressure is the dominant predictor (0.43 correlation, 55% feature importance)
-> Age and cholesterol are the next most impactful features
-> Lifestyle factors (smoking, alcohol, activity) showed weak predictive power. Their effects are already mediated through blood pressure
-> 73% accuracy aligns with published benchmarks for this dataset
# Model Performance
# Try It
Enter patient data below to get a cardiovascular disease risk prediction from the live API.
$ curl -X POST /predict -d '{...}'
200 OK — response:
risk_level:
risk_score:
prediction:
# Analysis
## Correlation Heatmap
Systolic blood pressure (ap_hi) shows the strongest correlation with cardiovascular disease (0.43), followed by diastolic BP (ap_lo, 0.34) and age (0.24). Lifestyle factors like smoking and alcohol show surprisingly weak direct correlations. Their effects are likely mediated through blood pressure. Notable feature redundancy exists: weight and bmi correlate at 0.86, confirming BMI captures weight information effectively.
## Features vs Target
Patients with cardiovascular disease show higher systolic blood pressure (median ~130 vs ~115), are slightly older (median ~55 vs ~52), and have marginally higher BMI. However, significant overlap between the two groups across all features explains the model's 73% accuracy ceiling. No single feature cleanly separates healthy from at-risk patients.
## SHAP Feature Importance
SHAP analysis reveals that high systolic blood pressure is the strongest driver of cardiovascular risk predictions, with values above 140 mmHg strongly pushing toward a positive diagnosis. Age and cholesterol follow as the next most impactful features. Notably, physical activity, smoking, and alcohol have minimal direct impact on predictions. This is consistent with the hypothesis that their effects are already captured by blood pressure and cholesterol levels.