Recommender Systems

Introduction to Recommender Systems

recommender system is a subclass of information filtering system that seeks to predict the “rating” or “preference” that a user would give to an object. The two most common types of recommender systems are Content-Based and Collaborative Filtering (CF). Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, hence it uses the behaviour of the crowd to recommend items. Content-based focus on attributes of the items and give you recommendations based on the similarity between them. In general, collaborative filtering is more commonly used than content-based systems because it usually gives better results and is relatively easy to understand. The algorithm (CF) has the ability to do feature learning on its own, which means that it can start to learn for itself what features to use.

Collaborative filtering can be divided into Memory-Based and Model-Based.

Case Study

In this case study, we will create a content based recommender system for a dataset of movies. We will be developing the recommendation systems using Python and pandas. We will focus on providing a basic recommendation system by suggesting items that are most similar to a particular item, in this case, movies.

1. Preparation

In [1]:
import numpy as np
import pandas as pd
In [34]:
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
In [35]:
df = pd.read_csv('u.data',sep='\t',names = column_names)
In [36]:
df.head()
Out[36]:
user_id item_id rating timestamp
0 0 50 5 881250949
1 0 172 5 881250949
2 0 133 1 881250949
3 196 242 3 881250949
4 186 302 3 891717742
In [37]:
movie_titles = pd.read_csv('Movie_Id_Titles')
In [38]:
movie_titles.head()
Out[38]:
item_id title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 3 Four Rooms (1995)
3 4 Get Shorty (1995)
4 5 Copycat (1995)
In [39]:
df = pd.merge(df, movie_titles,on='item_id')
In [40]:
df.head()
Out[40]:
user_id item_id rating timestamp title
0 0 50 5 881250949 Star Wars (1977)
1 290 50 5 880473582 Star Wars (1977)
2 79 50 4 891271545 Star Wars (1977)
3 2 50 5 888552084 Star Wars (1977)
4 8 50 5 879362124 Star Wars (1977)

2. Visualisation

In [41]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [42]:
sns.set_style('white')
In [45]:
df.groupby('title')['rating'].mean().sort_values(ascending=False).head() 
# rating with best rating but we did groupby sothe rating might come from only few people
Out[45]:
title
Marlene Dietrich: Shadow and Light (1996)     5.0
Prefontaine (1997)                            5.0
Santa with Muscles (1996)                     5.0
Star Kid (1997)                               5.0
Someone Else's America (1995)                 5.0
Name: rating, dtype: float64
In [46]:
df.groupby('title')['rating'].count().sort_values(ascending=False).head()
Out[46]:
title
Star Wars (1977)             584
Contact (1997)               509
Fargo (1996)                 508
Return of the Jedi (1983)    507
Liar Liar (1997)             485
Name: rating, dtype: int64
In [47]:
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
In [49]:
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
In [51]:
ratings.head() # this way we can make sure that high rating is supported by high no of people watching the movie
Out[51]:
rating num of ratings
title
‘Til There Was You (1997) 2.333333 9
1-900 (1994) 2.600000 5
101 Dalmatians (1996) 2.908257 109
12 Angry Men (1957) 4.344000 125
187 (1997) 3.024390 41
In [52]:
ratings['num of ratings'].hist(bins=70)
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a18c067b8>
In [53]:
ratings['rating'].hist(bins=70)
Out[53]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a18bf2e10>
In [55]:
# graphing the relationship between the average rating and the number of ratings
sns.jointplot(x = 'rating',y='num of ratings',data=ratings,alpha=0.5)
# graph shows that as the number of ratings increase, the rating of the movie increases!
Out[55]:
<seaborn.axisgrid.JointGrid at 0x1a21165390>

3. Content-based Movie Recommender System

Firstly we create a matrix that has the user ids on one axis and the movie title on another axis. Each cell will then consist of the rating the user gave to that movie. Note there will be a lot of NaN values, because most people have not seen most of the movies.

In [56]:
moviemat = df.pivot_table(index='user_id',columns = 'title', values = 'rating')
In [57]:
moviemat.head()
Out[57]:
title ‘Til There Was You (1997) 1-900 (1994) 101 Dalmatians (1996) 12 Angry Men (1957) 187 (1997) 2 Days in the Valley (1996) 20,000 Leagues Under the Sea (1954) 2001: A Space Odyssey (1968) 3 Ninjas: High Noon At Mega Mountain (1998) 39 Steps, The (1935) Yankee Zulu (1994) Year of the Horse (1997) You So Crazy (1994) Young Frankenstein (1974) Young Guns (1988) Young Guns II (1990) Young Poisoner’s Handbook, The (1995) Zeus and Roxanne (1997) unknown Á köldum klaka (Cold Fever) (1994)
user_id
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN 2.0 5.0 NaN NaN 3.0 4.0 NaN NaN NaN NaN NaN 5.0 3.0 NaN NaN NaN 4.0 NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1664 columns

In [58]:
ratings.sort_values('num of ratings',ascending=False).head(10)
Out[58]:
rating num of ratings
title
Star Wars (1977) 4.359589 584
Contact (1997) 3.803536 509
Fargo (1996) 4.155512 508
Return of the Jedi (1983) 4.007890 507
Liar Liar (1997) 3.156701 485
English Patient, The (1996) 3.656965 481
Scream (1996) 3.441423 478
Toy Story (1995) 3.878319 452
Air Force One (1997) 3.631090 431
Independence Day (ID4) (1996) 3.438228 429
In [59]:
starwars_user_ratings = moviemat['Star Wars (1977)']
liarliar_user_ratings = moviemat['Liar Liar (1997)']
In [62]:
similar_to_starwars = moviemat.corrwith(starwars_user_ratings) # correlation of other movies with starwars
similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings) # correlation of other movies with liarliar
/Users/ryanong/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3154: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar)
/Users/ryanong/anaconda3/lib/python3.6/site-packages/numpy/lib/function_base.py:3088: RuntimeWarning: divide by zero encountered in double_scalars
  c *= 1. / np.float64(fact)
In [63]:
corr_starwars = pd.DataFrame(similar_to_starwars,columns=['Correlation'])
corr_starwars.dropna(inplace=True)
In [65]:
corr_starwars.head() #how correlated other movies rating is to starwars
Out[65]:
Correlation
title
‘Til There Was You (1997) 0.872872
1-900 (1994) -0.645497
101 Dalmatians (1996) 0.211132
12 Angry Men (1957) 0.184289
187 (1997) 0.027398
In [68]:
corr_starwars.sort_values('Correlation',ascending=False).head(10)
# not all data makes sense, the movies below might seem correlated in ratings but that rating might be given by one/two people
# we can fix this by setting a threshold of number of ratings and removing films that are below this threshold
Out[68]:
Correlation
title
Hollow Reed (1996) 1.0
Stripes (1981) 1.0
Beans of Egypt, Maine, The (1994) 1.0
Safe Passage (1994) 1.0
Old Lady Who Walked in the Sea, The (Vieille qui marchait dans la mer, La) (1991) 1.0
Outlaw, The (1943) 1.0
Line King: Al Hirschfeld, The (1996) 1.0
Hurricane Streets (1998) 1.0
Good Man in Africa, A (1994) 1.0
Scarlet Letter, The (1926) 1.0
In [69]:
# lets filter movies by num of ratings = 100
corr_starwars = corr_starwars.join(ratings['num of ratings'])
In [70]:
corr_starwars.head()
Out[70]:
Correlation num of ratings
title
‘Til There Was You (1997) 0.872872 9
1-900 (1994) -0.645497 5
101 Dalmatians (1996) 0.211132 109
12 Angry Men (1957) 0.184289 125
187 (1997) 0.027398 41
In [73]:
corr_starwars[corr_starwars['num of ratings']>100].sort_values('Correlation',ascending=False).head()
Out[73]:
Correlation num of ratings
title
Star Wars (1977) 1.000000 584
Empire Strikes Back, The (1980) 0.748353 368
Return of the Jedi (1983) 0.672556 507
Raiders of the Lost Ark (1981) 0.536117 420
Austin Powers: International Man of Mystery (1997) 0.377433 130
In [74]:
# now repeat the above steps for liarliar

corr_liarliar = pd.DataFrame(similar_to_liarliar,columns=['Correlation'])
corr_liarliar.dropna(inplace=True)
corr_liarliar = corr_liarliar.join(ratings['num of ratings'])
corr_liarliar[corr_liarliar['num of ratings']>100].sort_values('Correlation',ascending=False).head()
Out[74]:
Correlation num of ratings
title
Liar Liar (1997) 1.000000 485
Batman Forever (1995) 0.516968 114
Mask, The (1994) 0.484650 129
Down Periscope (1996) 0.472681 101
Con Air (1997) 0.469828 137

Leave a Reply

Your email address will not be published. Required fields are marked *