Day 20 - SageMaker 從訓練模型到部署評估 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 20

Cloud Native

AWS AI交易室實戰系列第 20 篇

Day 20 - SageMaker 從訓練模型到部署評估

15th鐵人賽

slinbb

2023-09-25 09:41:01

385 瀏覽

分享至

～預測已知道的預測～

接續昨天，我們繼續來訓練模型

要選擇正確的演算法，通常需要評估不同的模型以找到最適合資料的模型。為簡單起見，我們使用 SageMaker XGBoost Algorithm 內建演算法，不事先評估模型。

import sagemaker

region = sagemaker.Session().boto_region_name
print("AWS Region: {}".format(region))

role = sagemaker.get_execution_role()
print("RoleArn: {}".format(role))

建立 xgboost estimator

主要參數如下：

image_uri - 指定訓練容器映像 URI。在這個示例中，使用 sagemaker.image_uris.retrieve 指定了 SageMaker XGBoost 訓練容器的 URI。

role - SageMaker 用於代表您執行任務的執行角色。
instance_count 和 instance_type-用於模型訓練的 Amazon EC2 ML 計算實例的類型和數量。
對於此訓練練習，您使用了一個 ml.m4.xlarge 實例，該實例具有 4 個 CPU、16 GB 內存、Amazon Elastic Block Store (Amazon EBS) 存儲和高網絡性能。
volume_size - 要附加到訓練實例的 EBS 存儲卷的大小（以 GB 為單位）。
output_path - SageMaker 存儲模型工件和訓練結果的 S3 存儲桶路徑。
sagemaker_session - 管理與 SageMaker API 操作和訓練作業使用的其他 AWS 服務之間的交互的會話對象。
rules - 指定一個 SageMaker Debugger 內置規則的列表。在這個示例中，create_xgboost_report()
規則創建一個 XGBoost 报告，提供有关訓練進度和結果的見解，而 ProfilerReport() 規則創建了有關 EC2 計算資源利用率的報告。

from sagemaker.debugger import Rule, ProfilerRule, rule_configs
from sagemaker.session import TrainingInput

s3_output_location='s3://{}/{}/{}'.format(bucket, prefix, 'xgboost_model')

container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")
print(container)

xgb_model=sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type='ml.m4.xlarge',
    volume_size=5,
    output_path=s3_output_location,
    sagemaker_session=sagemaker.Session(),
    rules=[
        Rule.sagemaker(rule_configs.create_xgboost_report()),
        ProfilerRule.sagemaker(rule_configs.ProfilerReport())
    ]
)

設定 hyperparameters

xgb_model.set_hyperparameters(
    max_depth = 5,
    eta = 0.2,
    gamma = 4,
    min_child_weight = 6,
    subsample = 0.7,
    objective = "binary:logistic",
    num_round = 1000
)

設定 training data 來源

from sagemaker.session import TrainingInput

train_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "data/train.csv"), content_type="csv"
)
validation_input = TrainingInput(
    "s3://{}/{}/{}".format(bucket, prefix, "data/validation.csv"), content_type="csv"
)

開始訓練模型，差不多 10min 會有完整輸出至 S3

xgb_model.fit({"train": train_input, "validation": validation_input}, wait=True)

使用指令查看輸出是否產生,最後要確定有 xgboost_report.html 檔案產生，可能需要等待一下
之後拷貝到 notebook 目錄下

rule_output_path = xgb_model.output_path + "/" + \
										xgb_model.latest_training_job.job_name + "/rule-output"
! aws s3 ls {rule_output_path} --recursive
! aws s3 cp {rule_output_path} ./ --recursive

產生訓練結果的報表連結
這個連結可以點擊打開，如果要看到內容的所有圖片，要再點擊左上方的 'trust HTML'

from IPython.display import FileLink, FileLinks
display("Click link below to view the XGBoost Training report", \
					FileLink("CreateXgboostReport/xgboost_report.html"))

其中 Loss vs Step Graph 如下(AWS 官方文件說明這張圖顯示有過擬合現象 overfitting problem，但沒有多做解釋)：

查看建立的模型資料

xgb_model.model_data

部署訓練結果到 EC2，實際上這個 EC2 會以節點（endpoint）形式出現在 SageMaker 服務中

import sagemaker
from sagemaker.serializers import CSVSerializer
xgb_predictor=xgb_model.deploy(
    initial_instance_count=1,
    instance_type='ml.t2.medium',
    serializer=CSVSerializer()
)

上述部署程序完成後，可以把 endpoint 印出來

之後我們只需要透過 AWS SDK 對這個 endpoint 送 payload 就可以得到預測結果

xgb_predictor.endpoint_name
# output: 'sagemaker-xgboost-{timestamp}'

接著建立一個 predict function 將資料餵給之前建立的 xgb_predictor
我們需要完成的推論是，就既有的成人普查數據去推論是否年收入 > 50K
之前準備測試資料的時候，我們有一併新增 if income > 50K then 1, else 0 的欄位進去

import numpy as np
def predict(data, rows=1000):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
    predictions = ','.join([predictions, xgb_predictor.predict(array).decode('utf-8')])
    return np.fromstring(predictions[1:], sep=',')

接著我們評估模型，使用之前建立的 'test' data 當作輸入

import matplotlib.pyplot as plt

predictions=predict(test.to_numpy()[:,1:])
plt.hist(predictions)
plt.show()

test predict result

結果因為預測結果是浮點數，我們必須將結果一分為二， > 0.5 為真， < 0.5 為假

import sklearn

cutoff=0.5
print(sklearn.metrics.confusion_matrix(test.iloc[:, 0], \
			np.where(predictions > cutoff, 1, 0)))
print(sklearn.metrics.classification_report(test.iloc[:, 0], \
			np.where(predictions > cutoff, 1, 0)))

[[4670  356]
 [ 480 1007]]
              precision    recall  f1-score   support

           0       0.91      0.93      0.92      5026
           1       0.74      0.68      0.71      1487

    accuracy                           0.87      6513
   macro avg       0.82      0.80      0.81      6513
weighted avg       0.87      0.87      0.87      6513

可以知道預測為 0 (收入 < 50K) 的準確率為 0.91，預測為 1 (收入 > 50K) 的準確率為 0.74
但我們需要更精確的 cutoff 值，而非 0.5，計算 log loss function of the logistic regression
（把 cutoff 設成變數去跑，會得到曲線如下）

import matplotlib.pyplot as plt

cutoffs = np.arange(0.01, 1, 0.01)
log_loss = []
for c in cutoffs:
    log_loss.append(
        sklearn.metrics.log_loss(test.iloc[:, 0], np.where(predictions > c, 1, 0))
    )

plt.figure(figsize=(15,10))
plt.plot(cutoffs, log_loss)
plt.xlabel("Cutoff")
plt.ylabel("Log loss")

log errors curve

將最小誤差值印出

print(
    'Log loss is minimized at a cutoff of ', cutoffs[np.argmin(log_loss)], 
    ', and the log loss value at the minimum is ', np.min(log_loss)
)

做完之後別忘了清理：

刪除 SageMaker Endpoint
SageMaker 左側選單 Inference → Endpoints → 選擇 Endpoint → Actions → Delete
刪除 Endpoint configurations
SageMaker 左側選單 Inference → Endpoints configurations → 選擇 Endpoint → Actions → Delete
刪除 Model
SageMaker 左側選單 Inference → Models → 選擇 Endpoint → Actions → Delete
刪除 Notebook instances
SageMaker 左側選單 Notebook → Notebook instances → 選擇 instance → Actions → Stop
等到 instance stop 後，選擇 Actions → Delete