Project Leads: Iris Derry, Renee Li
Team Members: Davis Bartels, Davis Malmer, Dhanya Narayanan, Ian Buchanan, Ian Fischer, James Kim, Jiamin Yang, Linkai Li, Marisol Garrouste, Matthew Bellick, Melodie Jin, Nathan Martin, Radi Akbar, Xin Hao Ling, Xinqian Zhai, Zach Fewkes, Jayachandra Korrapati
About this project
To begin, many teams removed the “Lecture” entries from the dataset. These entries did not provide insight into user performance on questions, just whether they were watching instructional material. A deeper model might be able to utilize these entries, but for our purposes they were not needed. The rest of the data was very complete, i.e. there were few N/A values that needed to be removed.
The next step was to join our multiple data frames. We had our largest set of data in “train.csv” and two smaller sets called “questions.csv” and “lectures.csv”, which contained metadata about the questions and lectures, respectively. We were interested in some of the data in the “questions.csv” file for feature creation, like what part of the exam a question was from. So we employed Pandas join methods to take the features we wanted from the “questions.csv” file and add them to our train file.
Lastly, depending on the model teams worked with, data standardization and normalization was done. Some teams chose to use a decision tree based model, which did not require any data manipulation, but other teams with distance based classifiers benefitted from standardization.
Figure 1. Distribution of % questions correctly answered, grouped by how long users have been using the app.
We did some exploratory data analysis to get to know our data, particularly since we had some background intuition about educational apps. For example, we expected users new to the app to perform worse than users who have been on the app for a while, which we confirmed using the visualizations in figure 1 above. We used similar methods to explore other features, which helped inform feature selection and engineering in later steps.
Each team was free to choose their classification model and work on it as best they could.
Logistic Regression: Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.
Random Forest Classification: Random decision forests construct a multitude of decision trees at training time and output the class that is the mode of the classes of the individual trees. It outputs the most common classification of all trees. Ensemble algorithms are those which combine more than one algorithm of the same or different kind for classifying objects, the idea being that multiple weak trees, when merged, create a stronger, more accurate decision tree.
LGBM (Light Gradient Boosting Machine) is also a tree based learning algorithm. LightGBM uses histogram-based algorithms, which splits continuous feature values into discrete bins. This speeds up training and reduces memory usage.
Why did you do this project?
After submitting each team’s model, the competition runs a simulation and tests each model. Here are the following accuracy rates;
1st Logistic Regression: 73%
Random Forest Classification: 72%
2nd Logistic Regression: 61%
Although the models differ in many aspects, most of them share the same features. These features are; mean accuracy per TOEIC part, mean accuracy per question, mean accuracy of user, if the user saw explanation of prior question.Thus, the main differences between each model is the type. From the results, LGBM has the highest accuracy rate out of all the models, which leads us to conclude that perhaps decision tree models are more suited for this problem than weighted probability models like logistic regression. Furthermore, it seems like some of the features use future data like user mean accuracy, which can be a problem when predicting a new user with no prior experience, so a model that constantly updates itself with new information about the user would also be useful at predicting their answers.
Although our best model has a 76% accuracy rate, there is always room for improvement. First, data like timestamp and prior question elapsed time were completely ignored. From observing other models online, there were creative usage of timestamps like ‘good week’ or ‘mood’ of the user at that time. Furthermore, sophisticated features like ‘learning rates’, which measures a user’s ability to get the next attempt correct given that their previous attempt failed, would have been useful in determining the user’s correct answer.
Second, there are sophisticated machine learning models that deal with learning more complicated patterns. Models like transformers, which also happens to be the most used model online, would be very useful in weighing the different inputs of the data and recognize patterns that are hard to visualize in EDA’s. What this implies is that predicting a user’s answer is much more difficult with our current model without tapping into future data and that there are more variables contributing to a user’s ability to get an answer correct than simple statistical metrics like their accuracy rate or difficulty rate of the question. This lines-up nicely with the reality of cognitive science, since the learning mechanism of a human being is still largely a mystery. Understanding these mechanisms could potentially help design machine learning algorithms that can better predict user’s learning patterns.