Project Leaders: Justin Paul, Sachchit Kunichetty
Team Members: (in alphabetical order by first name) Charlie Tran, Danny Colleran, Griffin Olson-Allen, Hanh Nguyen, Josh Zhang, Justin Xiao, Kaixin Zhao, Kyle Hoffmeyer, Rohan Gupta, Steve Fan, Ting-Ting Kao
Project Repository: https://github.com/MichiganDataScienceTeam/movie-recommendations
Visualization of the movie latent vectors during training of Asymmetric SVD model. Each dot corresponds to one movie. Notice the two clusters that arise, representing some non-trivial relationship of movies being learned.
Let's set the scene: you're an avid fan of superhero movies and drama films. You've already watched every Avengers movie and also Inception, those being some of your favorite movies. The next day, your favorite streaming service recommends you watch The Dark Knight - and you love it! It's the perfect combination of superhero action and drama that you craved.
Let's take a step back though - how did the streaming service know you would like The Dark Knight? Additionally, how can we create software to automatically perform such recommendations?
In its fundamental components, recommending movies involves
Understanding user preferences for all movies - for example, a 1-5 star rating for all movies
Choosing the movies that the user most prefers - for example, recommending movies that the user have given 5 stars
This will always produce the best recommendations - if the user would give a movie 5 stars, they will certainly enjoy watching that movie.
However, streaming services don't have access to user preferences for all movies - users have only watched a small subset of all movies, and so it becomes impossible to choose the movies that the user would like the most.
This is where data science comes into play - by studying movie rating data and analyzing trends in how other users watch movies, we could guess the ratings that a user will give movies, and then recommend movies which the user is predicted to give the highest rating to. This method of recommendation is called collaborative filtering.
To make these guesses, we studied two fundamental types of collaborative filtering approaches: a Nearest Neighbor approach and a Latent Factor approach.
To train and test the algorithms we analyzed on this project, we utilized the MovieLens 100k dataset, consisting of partial ratings for 9,000 movies by 600 users. Each rating is on the range 0.5 stars - 5 stars. On average, about 1.6% of movies were rated per user. All other ratings remain for us to predict using the different algorithmic approaches.
To prepare the data for application to collaborative filtering tasks, the ratings were arranged into a utility matrix where each row corresponds to a single user in the dataset, and each column corresponds to a movie in the dataset. Each entry represents the rating that the user for that row gave to the movie for that column. All missing ratings were given the default value 0.
The transposed utility matrix assembled from the MovieLens dataset. The image above is transposed, but by convention each row corresponds to a user and each column corresponds to a movie. The utility matrix is very sparse, with all missing ratings given a default value of 0. Our goal is predict all the missing ratings, allowing us to determine which movies the user would like to watch.
To gain additional data alongside MovieLens, we scraped IMDB to obtain the IMDB average rating, Suitable Audience, and Runtime of the movie. This was initially planned to be used to augment the algorithms below to provide more contextual information in our predictions. However due to timing constraints, this extension was not feasible and so this data was not used heavily.
Key: Users that have similar ratings patterns for the movies they have watched are likely to rate all other movies similarly. Use this relationship to guess the rating that a user would give for some movie.
For example, let us consider the case of Alice, Bob, and Carl,
Alice rated The Lord of the Rings as 5 stars and Carl rated The Lord of the Rings as 2 stars. It seems that Carl and Alice generally rate opposingly to each other. If Alice rates The Godfather as 1 star, we might predict that Carl rates The Godfather as 4 stars.
Bob rated Jurassic Park as 1 star and Carl rated Jurassic Park as 2 stars. It seems Bob and Carl tend to generally rate movies similarly. If Bob rates The Godfather as 5 stars, then we might predict that Carl rates The Godfather as 4 stars.
The relationship between movies and the associated ratings from each user can be represented in 610-D space (for each user). Then recommended movies are taken by comparing which movie vectors are closest to the ones that the user enjoyed. This can be done either by euclidean distance (which was used) or by angular distance (the cosine between the vectors). The movies with the smallest distance to the targeted movie is the one most similar from using nearest neighbors.
A major limitation of this method is that the dataset is sparse. There are multiple movies that either have no ratings or the exact same ratings (over 610 users) as other movies. If every movie was rated this would be a good thing and would make recommendations fairly obvious and easier to justify. However, as movies that users did not rate were simply given a rating of 0.0, this introduces unpredictability in recommending unpopular movies.
Another limitation was the inability to use content-based features (such as the genre, cast, or director) for movie recommendations. This was tested for predicting movie ratings and only achieved an accuracy of around 20%. There was not enough time to continue a content-based model nor enough time to fully incorporate some features into filtering for the final application.
For future use with this dataset, we believe that using word2vec in order to utilize the comments people gave for movies could be very helpful. Additionally, making use of a larger, less dense dataset would be useful to avoid general issues stemming from the long-tail problem.
Key: To predict ratings, learn some underlying latent factors that describe both users and movies, and match users to movies by comparing these factors. The strength of this match forms the rating for that user.
Latent factors are intuitively underlying properties that can describe both movies and users in the data. In this case, if some user/movie falls has some latent factor, the latent factor has a greater magnitude.
Asymmetric Singular Value Decomposition
With the notion of latent factors, the task of predicting movie ratings can be rewritten as the following: learn latent factors for movies and users so that the dot product of the user latent vector and the movie latent vector gives the rating that the user would give that movie.
Mathematically, this can be described as learning matrices Z, V that multiply with the utility matrix R to produce the predicted rating matrix as below:
This is the Asymmetric SVD formula for producing the utility matrix with all filled ratings with n users and m movies. Some important notes:
? values in the R matrix are the missing ratings that we would like to fill in
The vectors z in the second matrix are vectors describing how the user latent factors relate to the rating matrix. To get the user latent vectors, compute Rz
The vectors v in the third matrix are the latent vectors for each movie
This formulation, called asymmetric singular value decomposition differs from the standard latent factor approach to make the recommendation system model-based (independent of the number of users).
To learn Z and V , we train on reconstructing the utility matrix from before using Stochastic Gradient Descent (in practice we trained using the Adam optimizer). Batches are chosen randomly by selecting a random subset of users in the utility matrix for which ratings are reconstructed. The first 500 users form the training set, and the last 110 users form the testing set.
With the training settings
Latent Factors - 100
Loss - Mean Squared Error
Optimizer - AdamW w/ Weight Decay = 0.001
Learning Rate - 0.0001
Epochs = 10000
Batch Size = 128
a loss of 0.2 was reached on the test set, indicating that on average the predicted rating (in the range 0.5-5 stars) varied from the actual rating by 0.45 stars.
Loss curves on the training and testing set. The final testing set loss was about 0.2, indicating on average the predicted rating varied from the true rating by 0.44 stars. This indicates that our model has generally strong performance in predicted the correct rating.
The Asymmetric SVD latent factor algorithm makes realistic predictions, but struggles with certain genres.
Most highly rated movies for a randomly chosen user in dataset
Movies with highest predicted ratings for the randomly chosen user
Consider the example above. This user rates crime and drama movies such as Goodfellas (1990) and The Godfather (1972) highly. As a result, it should be expected that this user wants to watch more crime and drama movies. Indeed, the latent factor approach predicts predominantly crime and drama movies to have the highest ratings such as Fight Club (1999) and Reservoir Dogs (1992).
This model fails in that it tends to give more popular movies a greater rating such as Forrest Gump (1994), even though in terms of genre it is only loosely related to crime and genre. In other instances when a user highly rates movies from less popular genres like romance, this problem is exacerbated and the model tends to predict popular movies that do not correspond at all to the genre, as seen below.
Most highly rated movies for another randomly chosen user in dataset. Most of the movies here are romance and comedy movies.
Movies with highest predicted ratings for this randomly chosen user. Notice that very few of the movies recommended are romance and comedy movies. This is a fault of our model.
A subset of the team spent time designing a user interface to showcase both these algorithms and make recommendations. The user interface was developed using the Tkinter framework and interfaces both algorithms. In addition to making predictions, this UI allows users to filter out certain types of movies and target movies with certain restrictions (such as the year produced and the movie runtime). To view the code for this user interface, visit our GitHub repository: https://github.com/MichiganDataScienceTeam/movie-recommendations/tree/main/interface