Recommender System Using MovieLens

Overview

Have you ever wondered how Hulu or Netflix recommends new movies based on the watch history?

These suggestions or recommendations are made by a recommendation system. This model makes learns patterns from your watch history and uses those patterns to make new suggestions.

What are We Building?

We are building a movie recommendation system that takes in the user and movie ids and generates probable movie preferences for each user. We are using two types of filtering techniques for this purpose: collaboration and content. We will discuss them at a later stage.

Pre-requisites

These are the algorithms that one must know before building the project.

SVD(): Singular Value Decomposition (SVD) is a method from linear algebra that has been generally used as a dimensionality reduction technique in machine learning. For example, in recommendation systems, SVD() is used for matrix factorization. It finds factors of matrices from the factorization of a high-level (user-item-rating) matrix.
TF-IDF: TF-IDF, or term frequency-inverse document frequency, is a figure that expresses the statistical importance of any given word to the document collection as a whole. TF-IDF is calculated by multiplying term frequency and inverse document frequency.

How Are We Building This?

After playing around with the dataset and getting the gist of it, we are going to apply collaborative and content-filtering techniques to it.
While collaborative filtering, we are going to apply SVD() algorithm to the dataset to get our recommendations.
For getting our recommendations using content filtering, we are going to use TF-IDF.

Final Output

The final output will be a table that will contain the id of the user, the id of the movie and the corresponding movie name recommended by the model for the user.

Requirements

Python Libraries :-

Python.
Pandas.
scikit-learn.
Surprise.

Buliding the Classifier

Load the dataset

We choose the awesome Movielens dataset for the purpose of movie recommendation.

The Dataset at A Glance

movies.csv: It is a tabulated form of the description of the movies: title, tagline, and description. The table contains 21 distinct features and is meant to be used according to the model's needs.
ratings_small.csv: A table that records all the users’ rating behaviors and the ratings they provided. It also includes a time stamp when they posted the rates.
links.csv: A table that records each movie’s unique ID on two respective movie databases: IMDB and TMDB.

Code:-

Output:-

movielens dataset example

Analyze and Visualize the Dataset

Let's play around with the dataset and do some exploratory data analysis.

The total number of movies in the dataset is:-

Output:-

Let's classify the movies based on their genres:-

Output:-

dataset movies based on genres

Now, let's move on to see in which year how many movies were released:-

Output:-

dataset movies based on years

Finally, we are going to see what are the most popular movies in the dataset:-

Output:-

Now let's play around with the ratings_small dataset

index	userID	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

As you can see, the dataset contains the userIDs, movieIDs, ratings and timestamps.

The rating distribution looks like this:-

Output:-

dataset movies based on ratings

The most active users are:-

Output:-

The most rated movies are:-

Output:-

Collaborative Filtering

This method helps the model to learn the connections/similarity between users so that it can generate the best recommendation options based on user's choices, preferences, or tastes.

Collaborative filtering signifies how much a user has preferred a movie in the past.

To implement this technique, we use the Python Library Surprise. It provides a set of built-in algorithms commonly used in recommendation system development.

Code Implementation:-

We are going to compare 5 different algorithms: SVD(), KNNBasic(), NMF(), CoClustering(), SlopeOne().

Let's plot the performance of each algorithm:-

Output:-

plot for svd knbasic algorithm

As you can see, SVD() completely outperforms the other four. Hence we will go with it.

Next, we will build a function that will get the top n predictions from a set of predictions.

Let's get the predictions now:-

Ouput:-

UserIDs with corresponding recommended movieids is obtained.

Content Filtering

Content filtering needs the profile of both the users and the items so that the system can determine the recommendation according to the users’ and items’ common properties.

The collaborative filtering technique is very easy to implement and works in a very intuitive way. However, it is susceptible to the "cold start" problem. This means whenever a new item or user is added to the model and the model has no way to either promote the item to the consumers or suggest to the user any available options. The model has no idea of a new user's preference unless the user starts rating.

If a user has horror and thriller movies as his genre, then collaborative filtering may suggest the user some comedy movies based on the preferences of the other users with preferences for crime and horror. Content filtering would overcome this drawback.

To implement a content-filtering recommendation system, we utilize the TFIDF to reflect the importance of each genre in any movie.
The sum product of the important weights and users’ preferences towards different genres in the user's profile is calculated. Based on this, we could simply sort movies and suggest the users the top N candidates as the recommendations.

Since we are only considering the feature genre here, we have to make our datasets accordingly.

Output:-

output dataset genre

Now, we remove the unnecessary columns:-

removing unnecessary column from dataset

Create the Ratings Dataset

Output:-

Now we calculate the sum product of the important weights and users’ preferences towards different genres in the user's profile.

Output:-

sum-product-output

Creating the model (TF/IDF)

Output:-

creating-model-output

Build the recommender:-

index	id	title
36332	200654	Raffles
24783	70912	The Advocate
23657	64318	Hide-Out
23797	64928	It All Came True
1333	2075	Prizzi's Honor
18469	44589	The Clan - Tale of the Frogs
42053	330399	Rajathandhiram
2129	4912	Confessions of a Dangerous Mind
39384	267428	Buttwhistle
21973	56651	Kicked in the Head

We got the userID, the movieID and the corresponding recommendation.

Conclusion

There we go! We got our movie recommendation machine. The key takeaways from this article are:

There are two types of filtering methods: collaborative and content.
Collaborative building is very easy to implement but does not take into account the relationship between the user and the item. Hence, it is susceptible to "cold start" errors.
Content filtering overcomes the drawbacks of collaborative filtering.