Predicting student graduation in higher education using data mining models: a comparison

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
University of Alabama Libraries

Predictive modeling using data mining methods for early identification of students at risk can be very beneficial in improving student graduation rates. The data driven decision planning using data mining techniques is an innovative methodology that can be utilized by universities. The goal of this research study was to compare data mining techniques in assessing student graduation rates at The University of Alabama. Data analyses were performed using two different datasets. The first dataset included pre-college variables and the second dataset included pre-college variables along with college (end of first semester) variables. Both pre-college and college datasets after performing a 10-fold cross-validation indicated no difference in misclassification rates between logistic regression, decision tree, neural network, and random forest models. The misclassification rate indicates the error in predicting the actual number who graduated. The model misclassification rates for the college dataset were around 7% lower than the model misclassification rates for the pre-college dataset. The decision tree model was chosen as the best data mining model based on its advantages over the other data mining models due to ease of interpretation and handling of missing data. Although pre-college variables provide good information about student graduation, adding first semester information to pre-college variables provided better prediction of student graduation. The decision tree model for the college dataset indicated first semester GPA, status, earned hours, and high school GPA as the most important variables. Of the 22,099 students who were full-time, first time entering freshmen from 1995 to 2005, 7,293 did not graduate (33%). Of the 7,293 who did not graduate, 2,845 students (39%) had first semester GPA < 2.25 with less than 12 earned hours. This study found that institutions can use historical high school pre-college information and end of first semester data to build decision tree models that find significant variables which predict student graduation. Students at risk can be predicted at the end of the first semester instead of waiting until the end of the first year of school. The results from data mining analyses can be used to develop intervention programs to help students succeed in college and graduate.

Electronic Thesis or Dissertation
Statistics, Education