iT邦幫忙

第 11 屆 iThome 鐵人賽

DAY 28
0
AI & Data

AI&Machine Learning系列 第 28

(不專業的AI介紹) predict future sales Day28

  • 分享至 

  • xImage
  •  

本篇將會介紹 Kaggle 資料,關於預測未來的銷售情況,簡單的模擬情況表達出數據的呈現,感謝 https://www.kaggle.com/eeckhaut/coursera-predict-future-sales Kaggle 大大的文章,請各位持續持續支持。

import numpy as np 
import pandas as pd
import os

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns 
from tqdm import tqdm_notebook
from sklearn.preprocessing import LabelEncoder

pd.set_option('display.max_rows', 200); pd.set_option('display.max_columns', 100);
from xgboost import XGBRegressor, plot_importance
import lightgbm as lgb
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from IPython.display import FileLink
import pickle

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor

以上程式碼呈現出集合了許多圖形化以及數據處理,看起來有許多的模組可供使用,不過每個的功能並沒有說到最完美,那我們繼續看下去。

items = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/items.csv')
shops = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/shops.csv')
cats = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/item_categories.csv')
train = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/sales_train.csv', parse_dates=['date'], dayfirst=True)
test  = pd.read_csv('/kaggle/input/competitive-data-science-predict-future-sales/test.csv')

fig, ax = plt.subplots(ncols=2, figsize = (15,5))
train.item_price.clip(-100,10000).hist(bins=50, ax=ax[0]);
train.item_cnt_day.clip(-100,50).hist(bins=50, ax=ax[1]);

sale = train[(train.shop_id == 32) & (train.item_id == 2973) & (train.date_block_num == 4) & (train.item_price > 0)]
median = sale.item_price.median()
train.loc[train.item_price < 0, 'item_price'] = median


train['item_price'] = train['item_price'].clip(-1, 10**5)

train = train.loc[train.item_price>=0]
sale_not_neg = train[(train.shop_id == 32) & (train.item_id == 2973) & (train.date_block_num == 4) & (train.item_price > 0)]
train.loc[train.item_price < 0, 'item_price'] = sale_not_neg.item_price.median()

train = train.drop_duplicates(keep = 'first')

shops['city'] = shops['shop_name'].str.split(' ').map(lambda x: x[0])
shops['city_code'] = LabelEncoder().fit_transform(shops['city'])
shops = shops[['shop_id','city_code']]

cats['split'] = cats['item_category_name'].str.split('-')
cats['type'] = cats['split'].map(lambda x: x[0].strip())
cats['type_code'] = LabelEncoder().fit_transform(cats['type'])

cats['subtype'] = cats['split'].map(lambda x: x[1].strip() if len(x) > 1 else x[0].strip())
cats['subtype_code'] = LabelEncoder().fit_transform(cats['subtype'])
cats = cats[['item_category_id','type_code', 'subtype_code']]

以上程式碼從最上方開始下去,我們可以發現一如往常的一樣要利用 pandas 來讀取資料,接著將收集到的資料利用表格的特性來做一個檢查以及輸出,看出圖形是否正常,當圖形正常時,我們就進一步將他耦合,為了就是讓他可以做機器學習功能。

def evaluate_model(model, X_train, Y_train, X_valid, Y_valid, Y_test): 
    y_hat = model.predict(X_train).clip(0, 20)
    print('Train error;', np.sqrt(mean_squared_error(Y_train, y_hat)))
    y_val_hat = model.predict(X_valid).clip(0, 20)
    print('Valid error:', np.sqrt(mean_squared_error(Y_valid, y_val_hat)))

    y_test = model.predict(Y_test).clip(0, 20)

    return y_hat, y_val_hat, y_test
def create_lgbm_model(X_train, Y_train, X_valid, Y_valid, params, cat_features, early_stopping_rounds=50):
       
        n_estimators = 100
        d_train = lgb.Dataset(X_train, Y_train)
        d_valid = lgb.Dataset(X_valid, Y_valid)
        watchlist = [d_train, d_valid]
        evals_result = {}
        model = lgb.train(params, 
                          d_train, 
                          n_estimators,
                          valid_sets = watchlist, 
                          evals_result = evals_result, 
                          early_stopping_rounds = early_stopping_rounds,
                          verbose_eval = 1,
                          categorical_feature = cat_features,
                        )
        lgb.plot_metric(evals_result)
        return model

    params = {
      'metric': 'rmse',
      'objective': 'mse',
      'verbose': 0, 
      'learning_rate': 0.1,
      'num_leaves': 31,
      'min_data_in_leaf': 20 ,
      'max_depth': -1,
      'save_binary': True,
      'bagging_fraction': 0.8,
      'bagging_freq': 1,
      'bagging_seed': 2**7, 
      'feature_fraction': 0.8,
    }


    lgbm_model = create_lgbm_model(X_train, Y_train, X_valid, Y_valid, params, cat_features=categorical_features)
    

最後將機器學習的模組做成類別模式供使用,所以進入 model 模式我們就可以進行測試以及訓練。

經過這麼一長串的介紹,基本上我沒有詳細介紹,希望各位可以支持對方的好文章來吸收,因為本人真的也不太會講解,不過各位可以去嘗試打看看,有檔案的提供也有程式碼提供,從中學習真的會學到許多,希望大家也可以成長許多,步步高升。

謝謝大家,以上為不專業的AI介紹 那我們下篇見~~~~

參考資料:https://www.kaggle.com/eeckhaut/coursera-predict-future-sales


上一篇
(不專業的AI介紹) Imblearn 用法 -> python 使用 Day 27
下一篇
(不專業的AI介紹) 機器學習-Machine-Learning -> CNN in Keras for Kannada Digits Day 29
系列文
AI&Machine Learning30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言