Creating a Content-Based Movie Recommender with Python

Introduction
Prerequisites
Setup
Step 1: Loading the Data
Step 2: Preprocessing the Data
Step 3: Creating the Movie Profile
Step 4: Computing Similarities
Step 5: Generating Recommendations
Conclusion

Introduction

In this tutorial, we will create a content-based movie recommender using Python. Content-based recommendation systems make personalized recommendations based on the user’s past preferences and the characteristics of the items. In our case, we will analyze the textual information associated with movies to recommend similar movies to a given user.

By the end of this tutorial, you will be able to build a content-based movie recommender that suggests similar movies based on the user’s input.

Prerequisites

To follow along with this tutorial, you should have a basic understanding of Python programming and have the following libraries installed:

NumPy
pandas
scikit-learn

You can install these libraries using pip by running the following command: pip install numpy pandas scikit-learn

Setup

We will be using the MovieLens dataset, a popular movie rating dataset, to build our content-based movie recommender. Download the dataset from this link.

Once you have downloaded the dataset, extract the contents and place the files in the same directory as your Python script.

Step 1: Loading the Data

Let’s start by loading the MovieLens dataset into our Python script. We will be using the pandas library to work with the dataset. Open your Python script and import the necessary libraries: ```python import pandas as pd

# Load the movies dataset
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')
``` By executing the above code, we load the movies dataset and the ratings dataset into two separate DataFrames - `movies` and `ratings`.

Step 2: Preprocessing the Data

Before we can start building the recommender, we need to preprocess the data and extract the relevant information. In our case, we will only consider the movie title and genres. ```python # Remove unnecessary columns movies = movies[[‘movieId’, ‘title’, ‘genres’]]

# Merge with ratings dataset
movie_ratings = pd.merge(movies, ratings, on='movieId')

# Calculate the average rating for each movie
average_ratings = movie_ratings.groupby(['movieId', 'title'])['rating'].mean().reset_index()
``` In the above code, we remove unnecessary columns from the movies dataset and merge it with the ratings dataset. Then, we calculate the average rating for each movie using the `groupby` and `mean` functions.

Step 3: Creating the Movie Profile

To generate recommendations, we need to create a profile for each movie based on its genres. We will use the TfidfVectorizer class from the scikit-learn library to convert text documents into a matrix of TF-IDF features. ```python from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF matrix of movie genres
vectorizer = TfidfVectorizer(stop_words='english')
genre_matrix = vectorizer.fit_transform(movies['genres'].fillna(''))

# Convert the TF-IDF matrix to a DataFrame
genre_df = pd.DataFrame(genre_matrix.toarray(), columns=vectorizer.get_feature_names_out())
``` In the code above, we import the `TfidfVectorizer` class from `scikit-learn` and create a TF-IDF matrix of movie genres. We also convert the matrix into a DataFrame for easier manipulation.

Step 4: Computing Similarities

To recommend similar movies, we need to compute the similarity between movies based on their genre profiles. We can use the cosine similarity measure to compare the genre profiles. ```python from sklearn.metrics.pairwise import cosine_similarity

# Compute the cosine similarity matrix
similarity_matrix = cosine_similarity(genre_df, genre_df)
``` In the above code, we import the `cosine_similarity` function from `scikit-learn` and compute the cosine similarity matrix based on the genre DataFrame.

Step 5: Generating Recommendations

Now that we have the similarity matrix, we can generate movie recommendations for a given movie or user input. Let’s create a function that takes a movie title as input and returns the top N similar movies. ```python def get_recommendations(movie_title, top_n=5): # Get the index of the movie movie_index = movies[movies[‘title’] == movie_title].index[0]

    # Get the similarity scores for the movie index
    similarity_scores = similarity_matrix[movie_index]

    # Get the indices of top N similar movies
    similar_indices = similarity_scores.argsort()[-top_n-1:-1][::-1]

    # Get the titles and similarity scores of similar movies
    similar_movies = movies.iloc[similar_indices][['title', 'genres']]
    similar_scores = similarity_scores[similar_indices]

    return similar_movies, similar_scores
``` The above function takes a movie title as input and uses the similarity matrix to find the top N similar movies. It returns a DataFrame containing the titles and genres of the similar movies, as well as their similarity scores.

To test our recommender, we can call the function with a movie title: ```python movie_title = “Toy Story (1995)” recommendations, scores = get_recommendations(movie_title)

print("Recommendations for", movie_title)
print(recommendations)
print("Similarity Scores")
print(scores)
``` Congratulations! You have successfully created a content-based movie recommender using Python. By following the steps in this tutorial, you can now generate recommendations based on movie titles.

Conclusion

In this tutorial, you learned how to build a content-based movie recommender using Python. We used the MovieLens dataset, preprocessed the data, created movie profiles based on genres, computed similarities between movies, and generated recommendations based on user input.

Remember, content-based recommendation systems are just one approach to making personalized recommendations. There are other techniques like collaborative filtering and hybrid approaches that combine multiple methods for better results.

Keep exploring and experimenting to improve the recommender system or apply similar concepts to different domains. Happy coding!

Published: 24 August 2020