本篇將會介紹 Kaggle 資料,關於預測未來的銷售情況,簡單的模擬情況表達出數據的呈現,感謝 https://www.kaggle.com/eeckhaut/coursera-predict-future-sales Kaggle 大大的文章,請各位持續持續支持。
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from tqdm import tqdm_notebook
from sklearn.preprocessing import LabelEncoder
pd.set_option('display.max_rows', 200); pd.set_option('display.max_columns', 100);
from xgboost import XGBRegressor, plot_importance
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from IPython.display import FileLink
import pickle
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
以上程式碼呈現出集合了許多圖形化以及數據處理,看起來有許多的模組可供使用,不過每個的功能並沒有說到最完美,那我們繼續看下去。
items = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
shops = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
cats = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
train = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv', parse_dates=['date'], dayfirst=True)
test = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/test.csv')
fig, ax = plt.subplots(ncols=2, figsize = (15,5))
train.item_price.clip(-100,10000).hist(bins=50, ax=ax[0]);
train.item_cnt_day.clip(-100,50).hist(bins=50, ax=ax[1]);
sale = train[(train.shop_id == 32) & (train.item_id == 2973) & (train.date_block_num == 4) & (train.item_price > 0)]
median = sale.item_price.median()
train.loc[train.item_price < 0, 'item_price'] = median
train['item_price'] = train['item_price'].clip(-1, 10**5)
train = train.loc[train.item_price>=0]
sale_not_neg = train[(train.shop_id == 32) & (train.item_id == 2973) & (train.date_block_num == 4) & (train.item_price > 0)]
train.loc[train.item_price < 0, 'item_price'] = sale_not_neg.item_price.median()
train = train.drop_duplicates(keep = 'first')
shops['city'] = shops['shop_name'].str.split(' ').map(lambda x: x[0])
shops['city_code'] = LabelEncoder().fit_transform(shops['city'])
shops = shops[['shop_id','city_code']]
cats['split'] = cats['item_category_name'].str.split('-')
cats['type'] = cats['split'].map(lambda x: x[0].strip())
cats['type_code'] = LabelEncoder().fit_transform(cats['type'])
cats['subtype'] = cats['split'].map(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())
cats['subtype_code'] = LabelEncoder().fit_transform(cats['subtype'])
cats = cats[['item_category_id','type_code', 'subtype_code']]
以上程式碼從最上方開始下去,我們可以發現一如往常的一樣要利用 pandas 來讀取資料,接著將收集到的資料利用表格的特性來做一個檢查以及輸出,看出圖形是否正常,當圖形正常時,我們就進一步將他耦合,為了就是讓他可以做機器學習功能。
def evaluate_model(model, X_train, Y_train, X_valid, Y_valid, Y_test):
y_hat = model.predict(X_train).clip(0, 20)
print('Train error;', np.sqrt(mean_squared_error(Y_train, y_hat)))
y_val_hat = model.predict(X_valid).clip(0, 20)
print('Valid error:', np.sqrt(mean_squared_error(Y_valid, y_val_hat)))
y_test = model.predict(Y_test).clip(0, 20)
return y_hat, y_val_hat, y_test
def create_lgbm_model(X_train, Y_train, X_valid, Y_valid, params, cat_features, early_stopping_rounds=50):
n_estimators = 100
d_train = lgb.Dataset(X_train, Y_train)
d_valid = lgb.Dataset(X_valid, Y_valid)
watchlist = [d_train, d_valid]
evals_result = {}
model = lgb.train(params,
d_train,
n_estimators,
valid_sets = watchlist,
evals_result = evals_result,
early_stopping_rounds = early_stopping_rounds,
verbose_eval = 1,
categorical_feature = cat_features,
)
lgb.plot_metric(evals_result)
return model
params = {
'metric': 'rmse',
'objective': 'mse',
'verbose': 0,
'learning_rate': 0.1,
'num_leaves': 31,
'min_data_in_leaf': 20 ,
'max_depth': -1,
'save_binary': True,
'bagging_fraction': 0.8,
'bagging_freq': 1,
'bagging_seed': 2**7,
'feature_fraction': 0.8,
}
lgbm_model = create_lgbm_model(X_train, Y_train, X_valid, Y_valid, params, cat_features=categorical_features)
最後將機器學習的模組做成類別模式供使用,所以進入 model 模式我們就可以進行測試以及訓練。
經過這麼一長串的介紹,基本上我沒有詳細介紹,希望各位可以支持對方的好文章來吸收,因為本人真的也不太會講解,不過各位可以去嘗試打看看,有檔案的提供也有程式碼提供,從中學習真的會學到許多,希望大家也可以成長許多,步步高升。
謝謝大家,以上為不專業的AI介紹 那我們下篇見~~~~
參考資料:https://www.kaggle.com/eeckhaut/coursera-predict-future-sales