Data Analytics. Business Analysis. Project Management. Finance & Support Services.
Data Analytics. Business Analysis. Project Management. Finance & Support Services.
The primary objective of this project is to predict hospital readmission using data for hospital admission. The secondary objective is to evaluate whether an initial diagnosis of diabetes can serve as a reliable predictor for hospital readmission.
The data contains patient information - age, initial duration at the hospital, tests, diagnosis (primary, secondary and tertiary) and readmission status. Given the target variable is categorical (yes/no) for readmission, two classification models were employed: Logistic Regression and Neural Networks.
For the Logistic Regression model, due to a high number of predictors, feature selection
techniques were applied to enhance model performance, like Backward Elimination, Forward Selection, and Stepwise Selection. While the Exhaustive Search method was excluded due to its high computational cost, a Grid Search was implemented to optimize the neural network model by identifying the most effective number of hidden layer nodes for improved performance.
For the secondary objective, to assess if diabetes can be used to predict hospital readmission, a graphical presentation was used to compare the results to other diagnoses to aid the data analyst to understand the effect to readmission; while the coefficients were used to finalize the conclusion. The application of the model shall help in better hospital operations - readmissions are costly and often indicative of gaps in care delivery. High readmission rates are associated with increased
healthcare expenses and poor patient outcomes. Predicting which patients are at high risk of readmission can enable healthcare providers to deliver preventative interventions and reduce overall readmission rates.
Hospital readmission rates are widely recognized as a critical indicator of healthcare quality, efficiency, and patient outcomes. Elevated readmission rates often point to underlying systemic issues, such as insufficient discharge planning, inadequate patient education, or a lack of proper follow-up care. In the context of a healthcare industry moving steadily toward value-based care models, reducing avoidable readmissions has become a key priority. Addressing this issue not only improves the quality of care delivered to patients but also helps healthcare providers avoid penalties, reduce operational costs, and enhance long-term patient satisfaction and health outcomes.
Creating a comprehensive data analysis can significantly aid in identifying the factors contributing to readmissions, allowing healthcare providers to implement targeted interventions. This proactive approach not only benefits patients but also aligns with industry standards aimed at fostering accountability and excellence in healthcare delivery.
By leveraging the Kaggle hospital readmissions dataset, we were able to simulate real-world
hospital scenarios and develop predictive models that offer meaningful insights into readmission patterns.
A. Develop the Understanding
B. Obtain Data for Analysis
To conduct our analysis on hospital readmissions, we first identified and sourced a publicly available dataset that provides relevant and comprehensive healthcare data. After exploring various repositories, we selected a dataset from Kaggle titled "hospital Readmissions".
This dataset includes detailed patient-level information from over 100,000 hospital encounters for diabetic patients. It contains variables related to demographics, admission and discharge details, medical diagnoses, length of stay, medications, and whether the patient was readmitted. Specifically, it allows us to investigate factors influencing hospital readmission, making it suitable for predictive modeling and exploratory analysis in our case.
C. Explore, Clean and Preprocess Data / Reduce the Data Dimension
1. Number of Records: 25,000
2. Number of Columns: 17
3. Columns
Index(['age', 'time_in_hospital', 'n_lab_procedures', 'n_procedures','n_medications', 'n_outpatient', 'n_inpatient', 'n_emergency','medical_specialty', 'diag_1', 'diag_2', 'diag_3', 'glucose_test','A1Ctest', 'change', 'diabetes_med', 'readmitted'],dtype='object')
4. Column Data Types
age object
time_in_hospital int64
n_lab_procedures int64
n_procedures int64
n_medications int64
n_outpatie
nt int64
n_inpatient int64
n_emergency int64
medical_specialty object
diag_1 object
diag_2 object
diag_3 object
glucose_test object
A1Ctest object
change object
diabetes_med object
readmitted object
dtype: object
5. Converted Column Data Types
The fields that are non-numeric are converted to dummy variables to be used in the analysis. The result is 46 columns.
time_in_hospital int32
n_lab_procedures int32
n_procedures int32
n_medications int32
n_outpatient int32
n_inpatient int32
n_emergency int32
age_[50-60) int32
age_[60-70) int32
age_[70-80) int32
age_[80-90) int32
age_[90-100) int32
medical_specialty_Emergency/Trauma int32 medical_specialty_Family/GeneralPractice int32
medical_specialty_InternalMedicine int32
medical_specialty_Missing int32
medical_specialty_Other int32
medical_specialty_Surgery int32
diag_1_Diabetes int32
diag_1_Digestive int32
diag_1_Injury int32
diag_1_Missing int32
diag_1_Musculoskeletal int32
diag_1_Other int32
diag_1_Respiratory int32
diag_2_Diabetes int32
diag_2_Digestive int32
diag_2_Injury int32
diag_2_Missing int32
diag_2_Musculoskeletal int32
diag_2_Other int32
diag_2_Respiratory int32
diag_3_Diabetes int32
diag_3_Digestive int32
diag_3_Injury int32
diag_3_Missing int32
diag_3_Musculoskeletal int32
diag_3_Other int32
diag_3_Respiratory int32
glucose_test_no int32
glucose_test_normal int32
A1Ctest_no int32
A1Ctest_normal int32
change_yes int32
diabetes_med_yes int32
readmitted_yes int32
dtype: object
D. Determine the Data Mining Task
Index(['time_in_hospital', 'n_lab_procedures', 'n_procedures', 'n_medications', 'n_outpatient', 'n_inpatient', 'n_emergency', 'age_[50-60)', 'age_[60-70)', 'age_[70-80)', 'age_[80-90)', 'age_[90-100)', 'medical_specialty_Emergency/Trauma', 'medical_specialty_Family/GeneralPractice', 'medical_specialty_InternalMedicine', 'medical_specialty_Missing', 'medical_specialty_Other', 'medical_specialty_Surgery', 'diag_1_Diabetes', 'diag_1_Digestive', 'diag_1_Injury', 'diag_1_Missing', 'diag_1_Musculoskeletal', 'diag_1_Other', 'diag_1_Respiratory', 'diag_2_Diabetes', 'diag_2_Digestive', 'diag_2_Injury', 'diag_2_Missing', 'diag_2_Musculoskeletal', 'diag_2_Other', 'diag_2_Respiratory', 'diag_3_Diabetes', 'diag_3_Digestive', 'diag_3_Injury', 'diag_3_Missing', 'diag_3_Musculoskeletal', 'diag_3_Other', 'diag_3_Respiratory', 'glucose_test_no', 'glucose_test_normal', 'A1Ctest_no', 'A1Ctest_normal', 'change_yes', 'diabetes_med_yes'], dtype='object')
E. Partition the Data
To ensure there is no overfitting, initial data was partitioned into 60% Training data, and 40% Validation data using the train_test_split function. Other models was adjusted accordingly - 80%/20% split, and 70%/30% split
F. Techniques
G. Charts, Algorithm and Measures
LOGISTIC REGRESSION MODEL
Bar Chart for Data Visualization of Diagnosis Readmission
The initial Logistic regression model (Model #1), with a confusion matrix accuracy (validation) of around 61.57%, yields acceptable results using the full predictors of the data set.
The revised model (Model #2), and models using backwards elimination algorithm, forward elimination and stepwise resulted in almost similar accuracy of around 60% but still below the initial model - which provided no advantage.
The Neural Nets model (Model #6) yielded a confusion matrix accuracy (validation) of 61.92%, a little higher compared to the best Logistic Regression model (difference of 0.35%). Though GridSearch was used to improve performance, it did not improve the
accuracy making the original Neural Net (Model #6) performance better than all other 6
models.
Based on the coefficient for predictors, patients that have an initial diagnosis of Diabetes have a positive change on the readmission, whereas Secondary or Tertiary diagnosis has an inverse increase.
Further checking other predictors - A1Ctest_no, change_yes, and diabetes_med_yes has a positive change on the readmission.
Overall, the project has achieved a model that can predict a 61.92% accuracy for current
admitted patients with the focus on patients with initial diagnosis of diabetes (with no changes on A1C test, with changes in diabetes medication and prescribed diabetes
medication). With this the hospital can improve providing better health care services by
predicting which patients are at high risk of readmission and to enable healthcare providers to deliver preventative interventions and reduce overall readmission rates.
/END
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.