資料集下載:
http://files.grouplens.org/datasets/movielens/ml-100k.zip
資料含義:
u.data 表示 100k 條評分記錄,每一列的數值含義是:
user id | item id | rating | timestamp
u.user 表示使用者的資訊,每一列的數值含義是:
user id | age | gender | occupation | zip code
u.item 檔表示電影的相關資訊,每一列的數值含義是:
movie/item id | movie title | release date | video release date |IMDb URL | unknown | Action | Adventure | Animation | Children’s | Comedy | Crime | Documentary | Drama | Fantasy |Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |Thriller | War | Western |
API 文檔請參考 http://pandas.pydata.org/pandas-docs/stable/
# -*- coding: utf-8 -*-
import pandas as pd
users_names = ['user id', 'age', 'gender', 'occupation', 'zip code']
users = pd.read_csv('ml-100k/u.user', sep='|', names=users_names)
data_names = ['user id', 'item id', 'rating', 'timestamp']
data = pd.read_csv('ml-100k/u.data', sep='\t', names=data_names)
users_df = users[['user id', 'gender']]
data_df = data[['user id', 'rating']]
rating_df = pd.merge(users_df, data_df)
rating_df_mean = rating_df.groupby(['gender', 'user id']).mean()
print(rating_df_mean.groupby(['gender']).std())