Introduction to Recommender Systems
A recommender system is a subclass of information filtering system that seeks to predict the “rating” or “preference” that a user would give to an object. The two most common types of recommender systems are Content-Based and Collaborative Filtering (CF). Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, hence it uses the behaviour of the crowd to recommend items. Content-based focus on attributes of the items and give you recommendations based on the similarity between them. In general, collaborative filtering is more commonly used than content-based systems because it usually gives better results and is relatively easy to understand. The algorithm (CF) has the ability to do feature learning on its own, which means that it can start to learn for itself what features to use.
Collaborative filtering can be divided into Memory-Based and Model-Based.
Case Study
In this case study, we will create a content based recommender system for a dataset of movies. We will be developing the recommendation systems using Python and pandas. We will focus on providing a basic recommendation system by suggesting items that are most similar to a particular item, in this case, movies.
1. Preparation
import numpy as np
import pandas as pd
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data',sep='\t',names = column_names)
df.head()
movie_titles = pd.read_csv('Movie_Id_Titles')
movie_titles.head()
df = pd.merge(df, movie_titles,on='item_id')
df.head()
2. Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('white')
df.groupby('title')['rating'].mean().sort_values(ascending=False).head()
# rating with best rating but we did groupby sothe rating might come from only few people
df.groupby('title')['rating'].count().sort_values(ascending=False).head()
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head() # this way we can make sure that high rating is supported by high no of people watching the movie
ratings['num of ratings'].hist(bins=70)
ratings['rating'].hist(bins=70)
# graphing the relationship between the average rating and the number of ratings
sns.jointplot(x = 'rating',y='num of ratings',data=ratings,alpha=0.5)
# graph shows that as the number of ratings increase, the rating of the movie increases!
3. Content-based Movie Recommender System
Firstly we create a matrix that has the user ids on one axis and the movie title on another axis. Each cell will then consist of the rating the user gave to that movie. Note there will be a lot of NaN values, because most people have not seen most of the movies.
moviemat = df.pivot_table(index='user_id',columns = 'title', values = 'rating')
moviemat.head()
ratings.sort_values('num of ratings',ascending=False).head(10)
starwars_user_ratings = moviemat['Star Wars (1977)']
liarliar_user_ratings = moviemat['Liar Liar (1997)']
similar_to_starwars = moviemat.corrwith(starwars_user_ratings) # correlation of other movies with starwars
similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings) # correlation of other movies with liarliar
corr_starwars = pd.DataFrame(similar_to_starwars,columns=['Correlation'])
corr_starwars.dropna(inplace=True)
corr_starwars.head() #how correlated other movies rating is to starwars
corr_starwars.sort_values('Correlation',ascending=False).head(10)
# not all data makes sense, the movies below might seem correlated in ratings but that rating might be given by one/two people
# we can fix this by setting a threshold of number of ratings and removing films that are below this threshold
# lets filter movies by num of ratings = 100
corr_starwars = corr_starwars.join(ratings['num of ratings'])
corr_starwars.head()
corr_starwars[corr_starwars['num of ratings']>100].sort_values('Correlation',ascending=False).head()
# now repeat the above steps for liarliar
corr_liarliar = pd.DataFrame(similar_to_liarliar,columns=['Correlation'])
corr_liarliar.dropna(inplace=True)
corr_liarliar = corr_liarliar.join(ratings['num of ratings'])
corr_liarliar[corr_liarliar['num of ratings']>100].sort_values('Correlation',ascending=False).head()