上一篇介紹了『協同過濾』(Collaborative Filtering)的概念,今天我們就來撰寫程式實作看看。
GroupLens 提供各種不同大小的影評檔案給大家測試,避免執行時間過長,我們選擇最小的檔案ml-100k.zip測試,下載後解壓縮,其中README檔案有詳細敘述每一個資料檔的用途與欄位,這裡會用到的檔案說明如下:
處理步驟如下:
import pandas as pd
# Read the input training data
input_data_file_movie = "./ml-100k/u.item"
input_data_file_rating = "./ml-100k/u.data"
movie = pd.read_csv(input_data_file_movie, sep='|', encoding='ISO-8859-1', names=['movie_id', 'movie_title'], usecols = [0,1,])
rating = pd.read_csv(input_data_file_rating, sep='\t', encoding='ISO-8859-1', names=["user_id","movie_id","rating"], usecols = [0,1,2])
print(movie.head())
print(rating.head())
# then merge movie and rating data
data = pd.merge(movie,rating)
data.head()
pivot_table = data.pivot_table(index = ["user_id"],columns = ["movie_title"],values = "rating")
pivot_table.head(10)
movie_watched = pivot_table["Bad Boys (1995)"]
similarity_with_other_movies = pivot_table.corrwith(movie_watched) # find correlation between "Bad Boys (1995)" and other movies
similarity_with_other_movies = similarity_with_other_movies.sort_values(ascending=False)
similarity_with_other_movies.head()
# lets make a pivot table in order to make rows are users and columns are movies. And values are rating
pivot_table = data.pivot_table(index =["movie_title"],columns = ["user_id"],values = "rating")
print(pivot_table.head(10))
target_user = pivot_table[10]
similarity_with_other_movies = pivot_table.corrwith(target_user) # find correlation between "Bad Boys (1995)" and other movies
similarity_with_other_movies = similarity_with_other_movies.sort_values(ascending=False)
similarity_with_other_movies.head()
協同過濾的優/缺點如下:
以上程式是針對單一使用者瀏覽商品時,作出即時的推薦,如果,要執行所有的使用者的推薦清單,可能就要燒點錢,準備比較好的設備,採用平行計算,因為每一個相似性的計算都是可以獨立執行的。
下一篇我們繼續介紹『以模型為基礎的協同過濾』(Model Based Filtering)的作法。
相關程式碼放在這裡的 Day08 Collaborative Filtering 目錄。