Machine Learning to Predict Student Grades with CatBoost and Optuna

 Predicting student grades with quantitative and qualitative data at various time periods of the year using Catboost Regression/Classification and Optuna for Hyperparameter tuning. 

Recently I wanted to give some of my older projects a refresh with some new tools, tricks, and methods for Machine Learning. Since I’ve created these projects I’ve learned so much and wanted to take the opportunity to correct my approaches and mistakes. Doing this allowed me to use my previous model performance as a benchmark, as well as solidify and document some of my newer learnings. Rather than replace this older project I will keep it appended below the newer work I’ve done in order to highlight my learning journey. For my newer work I’ve focused more on the model building rather than recreating graphs and did not replicate the EDA I did in the first iteration. 


For this article I will walk through a recent project which allowed me to predict and visualize student grades from a well known dataset. I have heard a lot about the value of domain knowledge, and decided to utilize mine from the seven years I spent teaching high school Mathematics. It was hard to find education datasets that involved actual student grades, but UC Irvine has a wonderful dataset available from two schools in Portugal which include student grades for two classes (Math and Portuguese) and a small student survey to help with the analysis. I chose to work with only the Math dataset since that is my background. 

The Data Set – UCI Machine Learning Repository 

Original Research Paper – Using Data Mining to Predict Secondary School Student Success

CatBoost Tutorial

The dataset is small and consisted of only 33 feature columns and 395 rows. Many of the features are from a small student survey about their personal lives. Many of the survey questions are categorical yes/no or a qualitative rating from 1-5. There are questions about their health, family situations, and so on. On the academic side there are only three quantitative grades, one for each trimester, with the last one being their final grade. The data is also fairly imbalanced, not many students passed that year.


Data Engineering

Since CatBoost can handle the categorical data without any engineering or encoding I had to do very little before jumping straight into building models. Since only the final grades (G3) were given with no indication of pass/fail, I researched what was considered a pass/fail in the country the schools were located and created a separate column for that. Similarly I researched what the corresponding letter grades would be if this were an American school. 

Next there appeared to be a strange pattern in the column showing the amount of absences a student accumulated throughout the year. It seems that more times than not students were absent an even number of days rather than an odd number of days. This can happen sometimes when schools have alternating schedules where they have each class every other day, so by missing two days of school you only miss each class once. 

For my models to better learn how absences effect grades, I binned the odd days into even as seen below. 

Other than that I did almost nothing in terms of feature engineering, which is remarkable and one of the coolest features of CatBoost. The only thing Catboost needs is to be passed a list of all of the categorical feature column indices. 

Catboost Regression

Model 1: Predicting Final Student Number Grades After the 2nd Trimester without Feature Selection 

Splitting the Data

Since I’ll be doing three different types of predictions on this data (a regression, a binary classification, and multi-class classification), I’ll start each one by doing a different split of the data and assigning different labels. Here I’m doing an 80/20 split since the data is small and also using the stratify parameter to make sure each set has an approximately equal representation from this column. Normally I would stratify on the target column (their numerical final grade G3 for regression) but since this column has values that are too sparse I thought it best to use the letter grade column instead which is like G3 just binned and as a categorical string. And by too sparse I mean that there may only be one student that has a grade of 9, so that would be troublesome when the splitting function tries to get equal representation of 9’s in the train and test set. 

I’ll use the test set to test my final model, and further split the train set for validation inside the objective function next. 

Building the Objective Function

Optuna uses Bayesian optimization to tune the hyper parameters, so it requires an objective function that returns a metric, and inside the objective function would lie the parameters to be tuned as well as the CatBoost model. You can use these trial.suggest methods to specify a range and step for each hyper parameter to be tuned. You can also specify to use the GPU here but I opted for CPU since my data was small and I’d be training on Google Cloud (thanks, Google!). 

Stratified K-Fold Cross Validation

Since the data was small and after splitting 80% for training it became even smaller, I opted to use K-Fold Cross Validation inside of my optimization function in order to measure the model’s performance. By using the StratifiedKFold() function I am able to further preserve the imbalances in the data in both the train and validation sets. You may notice that the random_state parameter is set, which means that the same split will be performed for each iteration of the Optuna optimization. It turns out that using the same split every iteration resulted in better test results than using a different split every time (I tried many this with many different test sets). At first this was counter intuitive to me as I was expecting the randomization of the splits to help the model learn better, but now I’m thinking that the Optuna Bayesian process would learn better if it was presented with the same train and validation splits each iteration. For more discussion on this you can read here:

Perhaps when using completely random splits the number of training iterations would have to be increased in order to perform well. 

After validation the scores are appended to a list and the average score is returned from the objective function. With this approach Optuna will choose the model that performs best on all 5 splits of the training data. I chose to evaluate on the Mean Absolute Error, since it’s the most interpretable metric for evaluating school grades. 

Bayesian Optimization

After creating the objective function, Optuna is fairly straight forward to set up. When using MAE, I’d like Optuna to minimize this error. I opted to use the TPESampler, which is Bayesian Optimization, meaning that the model considers values as probability distributions and will update it’s understanding of these distributions with every iteration. It will continue to sample hyper parameters as it aims to improve the objectives score little by little with each iteration. 

You can easily plot the optimization history of Optuna to visualize how quickly it is learning and on which iterations the best scores were returned. And since this is charted in plotly, it’s also interactive. 

You can also see the importance of each hyper parameter on the objective function easily.

You can even see how each hyper parameter scored for each trial.

SHAP Values

Here you can see that for the first three predictions G2 and G1 are the main drivers pushing the predictions higher.

Here we can see G2 and G1 have a high positive impact on their final grade, as well as a low negative impact.

Here you can explore the impact of each feature on the first N predictions.


Most importantly we can see the hyper parameter values of the best model. This could also be a good time to see which values performed well and make any adjustments to your parameter ranges in the optimization function.

And finally the moment you’ve been waiting for… Optuna makes it easy to select the best hyper parameters to train a new model and evaluate the predictions again the test set.

Here I received an MAE of 0.86, which is quite good as it’s less than one whole grade point of an error. 

Model 2: Predicting Final Student Number Grades After the 2nd Trimester with Feature Selection 


I have learned that for boosting algorithms when there are correlated variables the model will learn to focus on only one of the variables and not the other. This data set has many correlated variables, and so even though CatBoost is set up to handle this, I decided to try to see if feature selection would improve the models score myself. Looking back, rather than perform my own feature selection, I think I would opt to use CatBoosts built in select_features method next time.

Correlation in XGBoost

Feature Selection

Now the Catboost algorithm already includes regularization, however when I opted to select only the best features I did see an increase in performance in score, but more notably in speed. To make sure this score performance wasn’t due to chance or to the randomness in the train and test splits, I ran it a handful of times and all performed better than without feature selection. But like I said even when using just 9 less features, it was much much faster. I didn’t do anything fancy to select my features, other than use the get_feature_importance method, then using a threshold of 0.08, slicing the dataset.

Setting up the new feature and test columns.


You can see here that the best model with feature selection was 1.04, and not, a huge improvement from the model with feature selection whose best model had a score of 1.0

However on the test set it scored much better with a MAE of only 0.78. And with the boost in speed, I think I would opt to quickly eliminate any un-useful columns moving forward in my next models and in the future.

Model 3: Predicting Final Student Number Grades After the 1st Trimester with Feature Selection

After the first trimester may be a little too early in the year to try to predict student grades, but we can certainly try. Here using the same process I was only able to obtain a MAE score of 1.85. 


Not a great performance, and rather than try to predict a student’s actual grade after the first trimester it might make more sense to try to predict whether they will pass or fail, which is what the next section is all about. 

Catboost Binary Classification

Model 1: Predicting Pass/Fail After the 2nd Trimester

Data Split & Features

For this prediction I will again use the stratify parameter in the train_test_split function but rather than split on the target variable of pass/fail, I will opt to split on the more granular but related variable showing the number of previous failures. I have found this to be more impactful on the classification model’s performance. 

Because of the speed performance I also used the same features from the previous model. Ideally I would have performed another baseline classification model to identify the features that are impactful toward a classification model.

The Objective Function, Stratified K-Fold CV, and the AUC ROC Score

Similar to the regression model, to work with Optuna you still need to build an objective function that returns a metric. Below you can see the hyper parameters I’ve chosen to tune which are slightly different than the regression models.

And again inside the objective function I’ve chosen to perform the same Stratified K-Fold cross validation. This time rather than returning the MAE, since this is a classification model, I’ve chosen to focus on and return the AUC ROC score.

And just a quick note that I had to adjust the Optuna direction parameter to “maximize” the objective function, since I’m now measuring AUC ROC.


After another 300 trials I can see the best hyper parameters and best AUC ROC score from the average cross validated AUC ROC scores.

The only plot I chose to focus on here is the Optimization History Plot, and you can see how quickly this model learns as well as how small the variance is aside from those few outliers.

When choosing the best parameters to retrain the model on, then testing on the test set. I was able to obtain and AUC ROC of 92.5. And that’s an A in my book.

I also like to visualize the confusion matrix as it’s more interpretable and I’m actually a little more concerned with false positives rather than false negatives. It’s better to think a student will fail and then be surprised by a pass, rather than thinking the student will pass but seeing them fail. This model performed very well in that regard as only 1 student in the test set was a false positive.

Model 2: Predicting Pass/Fail After the 1st Trimester

I ran the same model and metrics but this time from after the 1st trimester. 


This model also performed very well, with an AUC ROC of 87.7 against the test set.

This model performs almost as good as the model predicting after the 2nd trimester, just a few percentage points off. However, there is an even split between the false positives and negatives, which is still better than it favoring false positives.

All in all, these binary classification models were in my opinion the most useful to an educator looking to allocate resources towards student’s at risk of failing. Next up I’ll take a shot at predicting actual A-F letter grades. 

Catboost Multi-Class Classification

Here I will predict student’s final letter grades, which have been binned from their actual integer representation in the data set according to my research of the school’s grade structure. 

Model 1: Predicting Letter Grade A-F After the 2nd Trimester

Data Split & Features

Similar to the Regression models and Binary Classification models, I will again stratify the train and test set but this time I will stratify by the target variable, their final letter grades. So the train and test set have a similar amount of each of the final letter grades A-F.

And just as before I will use only the features I have found to be most important via the Regression performed previously. I’m doing this to save on training time, but in the future it would also be beneficial to test the feature importances again for classification.

The Objective Function, Stratified K-Fold CV, Imbalanced Class Weights, and the F1 Score

The objective function tunes the same hyper parameters and utilizes the same Stratified K-Fold process as the binary classification model. The only difference between these models is that I will weight the models by their imbalanced classes and that I will return the F1 function as it’s metric.

When looking at the original data and in the stratified train and test splits there is a mild class imbalance. In the test set for instance there are only 8 A’s, 12 B’s, 12 C’s, 21 D’s, and unfortunately 26 F’s. So to account for this and to help the model learn I will use CatBoost’s class_weightsparameter and input a calculated count from each validation set from the stratified split so that I’m not pretending to know the whole train set’s unique counts at the time of validation. To get the calculated count I use numpy’s unique function, as well as compute_class_weight from Scikit-learn.

F1 seemed like natural choice to measure the models performance since the data is slightly imbalanced. I could have binarized the labels in order to return the AUC ROC score to better compare to the binary classification model, but later I will use a visualization package that will output these scores for me automatically. I’m using the macro average rather than weighted average due to the class imbalances. I want the model to get penalized on misclassifications evenly across all of the classes. Now since the majority class happens to be the F letter grade, I could have used the weighted average to penalize these mistakes more, which I think next time I would explore. The cost of getting an F prediction wrong is higher than many of the other letter grades,


Now before training my final model on the entire train set I calculated the class weights again this time on the entire train set to use in the model. You can see the performance was an F1 score of 78.4.

And there appears to be greater variance in the optimization history’s objective value when compared to the pass/fail models, which makes sense.

And here looking at the confusion matrix, neither the predictions nor the true labels were more than one letter grade off. And in fact for the misclassifications of the F letter grades, it is more likely to mis-predict an F that is actually a D, rather than mis-predict a D that was actually an F, so I’m happy with this performance.

Using the yellowbrick package makes it very easy to not only graph ROC curves for a multi-classification model, but also output each class’s AUC ROC score as well. Since the package was designed to work on Scikit-learn models, you can also easily wrap the CatBoost model with the wrap() function. Here you can see the AUC ROC scores of the A and F classes are slightly higher than the other classes, which would work well for this potential use case.

Model 2: Predicting Letter Grade A-F After the 1st Trimester

Here I followed the same procedure as the first multi-classifier model. 


With almost an F1 score of 53, I’m not too happy with this model’s performance.

And you can see that predicting this early in the year leads to predictions being more than one letter grade off, which is not ideal.

And looking at each classes respective AUC ROC score, it seems that A’s and F’s appear to be easier predict after the first trimester.


Like many others I too am impressed by the ease and performance of the various CatBoost models. They performed as well as XGBoost, but without the hassle of data preprocessing and encoding. Optuna was quick to find the best parameters, and the visualizations made me comfortable and give me added confidence in their choices. As an added bonus, both of these packages have great documentation and resources behind them. There are plenty more features, functions, and methods to explore, so I will definitely keep these tools at the top of my tool box.

Get in touch at: