Project Leads: Iris Derry, Renee Li
Team Members: Davis Bartels, Davis Malmer, Dhanya Narayanan, Ian Buchanan, Ian Fischer, James Kim, Jiamin Yang, Linkai Li, Marisol Garrouste, Matthew Bellick, Melodie Jin, Nathan Martin, Radi Akbar, Xin Hao Ling, Xinqian Zhai, Zach Fewkes, Jayachandra Korrapati
About this project
This project was based on Kaggle’s 2021 Riiid Answer Correctness Prediction competition. For this project specifically, our goal is to develop a machine learning model--or several, since we divided into four groups--that can accurately predict a student’s future performance.
We use data provided by Kaggle, which was pulled from Santa TOEIC, a Korean app made by RIIID that aims to prepare Korean students and professionals for the TOEIC exam. The TOEIC is a test of English preparedness for common workplace environments. The app uses AI to tailor students’ learning; it analyzes what skills a student seems to be struggling with and gives them relevant lectures and questions so they can improve.
The data set provided by Kaggle includes data such as the amount of time a user/student has spent on the app, what questions they answered and if they answered correctly, and what lectures they watched. We want to find a way of identifying whether students will get a future question correct given not only their past performance, but other factors like time on the app, time between questions, etc.
To begin, many teams removed the “Lecture” entries from the dataset. These entries did not provide insight into user performance on questions, just whether they were watching instructional material. A deeper model might be able to utilize these entries, but for our purposes they were not needed. The rest of the data was very complete, i.e. there were few N/A values that needed to be removed.
The next step was to join our multiple data frames. We had our largest set of data in “train.csv” and two smaller sets called “questions.csv” and “lectures.csv”, which contained metadata about the questions and lectures, respectively. We were interested in some of the data in the “questions.csv” file for feature creation, like what part of the exam a question was from. So we employed Pandas join methods to take the features we wanted from the “questions.csv” file and add them to our train file.
Lastly, depending on the model teams worked with, data standardization and normalization was done. Some teams chose to use a decision tree based model, which did not require any data manipulation, but other teams with distance based classifiers benefitted from standardization.
Figure 1. Distribution of % questions correctly answered, grouped by how long users have been using the app.
We did some exploratory data analysis to get to know our data, particularly since we had some background intuition about educational apps. For example, we expected users new to the app to perform worse than users who have been on the app for a while, which we confirmed using the visualizations in figure 1 above. We used similar methods to explore other features, which helped inform feature selection and engineering in later steps.
The first features we were interested in were user performance features. Our training data was structured similar to an activity log, in that it had the timestamp and result of each user’s question they answered on the app, but no cumulative statistics. So, we aggregated various statistics on user IDs, like total questions answered, total time on the app, and the percentage of questions answered they got right.
Next, we were interested in content statistics. As mentioned before, each question has a corresponding part of the TOEIC exam, and we found that people as a whole performed better on some parts than others. So, we created features that told us the average performance of people on a given part of each test, which allows us to make better guesses about the performance of users with few to no questions answered.
Lastly, we had to consider future information considering the time series nature of this project. It was important to now allow future data our model hasn’t seen yet to pollute our predictions. We may have access to all 50 questions a user has answered in training, but if we haven’t seen past question 30 we shouldn’t use that in our user performance mean. So, we had to use loops to be sure to update our user performance statistics as we moved through our data.
Each team was free to choose their classification model and work on it as best they could.
Logistic Regression: Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.
Random Forest Classification: Random decision forests construct a multitude of decision trees at training time and output the class that is the mode of the classes of the individual trees. It outputs the most common classification of all trees. Ensemble algorithms are those which combine more than one algorithm of the same or different kind for classifying objects, the idea being that multiple weak trees, when merged, create a stronger, more accurate decision tree.
LGBM (Light Gradient Boosting Machine) is also a tree based learning algorithm. LightGBM uses histogram-based algorithms, which splits continuous feature values into discrete bins. This speeds up training and reduces memory usage.
Why did you do this project?
For wolverines who are still learning data science, Kaggle’s 2021 Riiid Answer Correctness Prediction competition is a friendly and challenging semester project. This project provides us with a great opportunity to practice basic data processing skills such as data cleaning, data visualization and data analysis, as well as more complex skills such as big data processing, feature engineer, and model selection and evaluation. In addition, learning how to transform business understanding into data understanding is also a very important and indispensable skill, which we have also practiced in this project.
Through the collaboration within a team and the discussions between teams, each of our team completed the project and submitted the final result of the Riiid competition on Kaggle. During the whole journey, we have consolidated existing data proc
After submitting each team’s model, the competition runs a simulation and tests each model. Here are the following accuracy rates;
1st Logistic Regression: 73%
Random Forest Classification: 72%
2nd Logistic Regression: 61%
Although the models differ in many aspects, most of them share the same features. These features are; mean accuracy per TOEIC part, mean accuracy per question, mean accuracy of user, if the user saw explanation of prior question.Thus, the main differences between each model is the type. From the results, LGBM has the highest accuracy rate out of all the models, which leads us to conclude that perhaps decision tree models are more suited for this problem than weighted probability models like logistic regression. Furthermore, it seems like some of the features use future data like user mean accuracy, which can be a problem when predicting a new user with no prior experience, so a model that constantly updates itself with new information about the user would also be useful at predicting their answers.
Although our best model has a 76% accuracy rate, there is always room for improvement. First, data like timestamp and prior question elapsed time were completely ignored. From observing other models online, there were creative usage of timestamps like ‘good week’ or ‘mood’ of the user at that time. Furthermore, sophisticated features like ‘learning rates’, which measures a user’s ability to get the next attempt correct given that their previous attempt failed, would have been useful in determining the user’s correct answer.
Second, there are sophisticated machine learning models that deal with learning more complicated patterns. Models like transformers, which also happens to be the most used model online, would be very useful in weighing the different inputs of the data and recognize patterns that are hard to visualize in EDA’s. What this implies is that predicting a user’s answer is much more difficult with our current model without tapping into future data and that there are more variables contributing to a user’s ability to get an answer correct than simple statistical metrics like their accuracy rate or difficulty rate of the question. This lines-up nicely with the reality of cognitive science, since the learning mechanism of a human being is still largely a mystery. Understanding these mechanisms could potentially help design machine learning algorithms that can better predict user’s learning patterns.