【Day27】3rd：Playground－訓練模型（by TensorFlow）#2

11th鐵人賽 machine learning google machinelearning

eyelash*睫毛

2019-10-12 23:55:51

5077 瀏覽

分享至

現在使用TensorFlow來訓練我們的這個模組。

以下會有幾個步驟：開始準備、製作特徵、以年齡當做分類的特徵、定義模型特徵、訓練深度神經網路模型、評估神經網路的效果、製作混淆矩陣、子分類觀察。

使用：遊戲區

開始準備

我們這次會用TensorFlow的Estimator API去執行DNNClassifier這個class。首先必須把我們的資料集，轉換成Tensor用的pandas（使用tf.estimator.inputs.pandas_input_fn()），這樣才能進行後續的資料處理。

def csv_to_pandas_input_fn(data, batch_size=100, num_epochs=1, shuffle=False):
  return tf.estimator.inputs.pandas_input_fn(
      x=data.drop('income_bracket', axis=1),
      y=data['income_bracket'].apply(lambda x: ">50K" in x).astype(int),
      batch_size=batch_size,
      num_epochs=num_epochs,
      shuffle=shuffle,
      num_threads=1)

製作特徵

TensorFlow 需要把資料放在model內，並且製作出特徵，使用TensorFlow的tf.feature_columns；製作數據化的特徵點（使用feature_column.numeric_column()）。

# Since we don't know the full range of possible values with occupation and
# native_country, we'll use categorical_column_with_hash_bucket() to help map
# each feature string into an integer ID.
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    "occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
    "native_country", hash_bucket_size=1000)

# For the remaining categorical features, since we know what the possible values
# are, we can be more explicit and use categorical_column_with_vocabulary_list()
gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])
race = tf.feature_column.categorical_column_with_vocabulary_list(
    "race", [
        "White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"
    ])
education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
    ])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    "marital_status", [
        "Married-civ-spouse", "Divorced", "Married-spouse-absent",
        "Never-married", "Separated", "Married-AF-spouse", "Widowed"
    ])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    "relationship", [
        "Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
        "Other-relative"
    ])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    "workclass", [
        "Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
        "Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
    ])

設定數值的特徵列

age = tf.feature_column.numeric_column("age")
fnlwgt = tf.feature_column.numeric_column("fnlwgt")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

以年齡當做分類的特徵

在把資料轉換好，成為TensorFlow所需要的格式後，我們需要做出分類的特徵值。本次使用的是年齡，要把年齡進行分類，成為數據化的特徵（也就是ordinal feature）。

所以我們進行這幾個分類群組：18, 25, 30, 35, 40, 45, 50, 55, 60, 65

age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

定義模型特徵

在開始訓練模型的時候，我們要選定該模型的特徵，這時候選性別為子群組，並且存在subgroup_variables list。

variables = [native_country, education, occupation, workclass,
             relationship, age_buckets]
subgroup_variables = [gender]
feature_columns = variables + subgroup_variables

訓練深度神經網路模型

定義隱藏層

定義一個有著兩個隱藏層的前饋神經網絡（feed-forward neural network）。其中將高維度分類特徵轉換為低維度且密集的實值向量，被稱為嵌入向量。

deep_columns = [
    tf.feature_column.indicator_column(workclass),
    tf.feature_column.indicator_column(education),
    tf.feature_column.indicator_column(age_buckets),
    tf.feature_column.indicator_column(gender),
    tf.feature_column.indicator_column(relationship),
    tf.feature_column.embedding_column(native_country, dimension=8),
    tf.feature_column.embedding_column(occupation, dimension=8),
]

定義深度神經網路模型

定義成：

HIDDEN_UNITS：[1024, 512]
LEARNING_RATE：0.1
L1_REGULARIZATION_STRENGTH：0.0001
L2_REGULARIZATION_STRENGTH：0.0001

HIDDEN_UNITS = [1024, 512] #@param
LEARNING_RATE = 0.1 #@param
L1_REGULARIZATION_STRENGTH = 0.0001 #@param
L2_REGULARIZATION_STRENGTH = 0.0001 #@param

model_dir = tempfile.mkdtemp()
single_task_deep_model = tf.estimator.DNNClassifier(
    feature_columns=deep_columns,
    hidden_units=HIDDEN_UNITS,
    optimizer=tf.train.ProximalAdagradOptimizer(
      learning_rate=LEARNING_RATE,
      l1_regularization_strength=L1_REGULARIZATION_STRENGTH,
      l2_regularization_strength=L2_REGULARIZATION_STRENGTH),
    model_dir=model_dir)

接著我們使用本資料集的訓練集，進行1000次的訓練：

STEPS = 1000 #@param

single_task_deep_model.train(
    input_fn=csv_to_pandas_input_fn(train_df, num_epochs=None, shuffle=True),
    steps=STEPS);

最後，在遊戲場的提示會顯示如下的資料，也就是表示資料跑完的結果，以及Loss的狀況：

INFO:tensorflow:global_step/sec: 44.7529
INFO:tensorflow:loss = 31.369343, step = 901 (2.237 sec)
INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tmptq25uz56/model.ckpt.
INFO:tensorflow:Loss for final step: 39.133175.
Deep neural net model is done fitting.

評估神經網路的效果

使用下面的程式碼進行評估：

results = single_task_deep_model.evaluate(
    input_fn=csv_to_pandas_input_fn(test_df, num_epochs=1, shuffle=False),
    steps=None)
print("model directory = %s" % model_dir)
print("---- Results ----")
for key in sorted(results):
  print("%s: %s" % (key, results[key]))

結果如下資料，感覺效果還算不錯：

---- Results ----
accuracy: 0.8331341
accuracy_baseline: 0.7543161
auc: 0.8845352
auc_precision_recall: 0.70381314
average_loss: 0.35597283
global_step: 1000
label/mean: 0.24568394
loss: 35.502983
precision: 0.6960687
prediction/mean: 0.24560678
recall: 0.56945944

製作混淆矩陣

製作混淆矩陣，來評估我們的結果。

首先要做的是我們矩陣的樣子，也就是矩陣的欄位以及值（TN、TP、FN、FP）：

def compute_eval_metrics(references, predictions):
  tn, fp, fn, tp = confusion_matrix(references, predictions).ravel()
  precision = tp / float(tp + fp)
  recall = tp / float(tp + fn)
  false_positive_rate = fp / float(fp + tn)
  return precision, recall, false_positive_rate

製作完後，當然是要觀察囉！所以進行資料調整與視覺化調整（將使用先前的圖表），根據表去觀察結果：

def plot_confusion_matrix(confusion_matrix, class_names, figsize = (8,6)):
    # We're taking our calculated binary confusion matrix that's already in form 
    # of an array and turning it into a Pandas DataFrame because it's a lot 
    # easier to work with when visualizing a heat map in Seaborn.
    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names, 
    )
    fig = plt.figure(figsize=figsize)
    
    # Combine the instance (numercial value) with its description
    strings = np.asarray([['True Positives', 'False Negatives'],
                          ['False Positives', 'True Negatives']])
    labels = (np.asarray(
        ["{0:d}\n{1}".format(value, string) for string, value in zip(
            strings.flatten(), confusion_matrix.flatten())])).reshape(2, 2)

    heatmap = sns.heatmap(df_cm, annot=labels, fmt="");
    heatmap.yaxis.set_ticklabels(
        heatmap.yaxis.get_ticklabels(), rotation=0, ha='right')
    heatmap.xaxis.set_ticklabels(
        heatmap.xaxis.get_ticklabels(), rotation=45, ha='right')
    plt.ylabel('References')
    plt.xlabel('Predictions')
    return fig

子分類進行觀察

如果使用範例的程式碼後：

#@title Visualize Binary Confusion Matrix and Compute Evaluation Metrics Per Subgroup
CATEGORY  =  "gender" #@param {type:"string"}
SUBGROUP =  "Male" #@param {type:"string"}

# Given define subgroup, generate predictions and obtain its corresponding 
# ground truth.
predictions_dict = single_task_deep_model.predict(input_fn=csv_to_pandas_input_fn(
    test_df.loc[test_df[CATEGORY] == SUBGROUP], num_epochs=1, shuffle=False))
predictions = []
for prediction_item, in zip(predictions_dict):
    predictions.append(prediction_item['class_ids'][0])
actuals = list(
    test_df.loc[test_df[CATEGORY] == SUBGROUP]['income_bracket'].apply(
        lambda x: '>50K' in x).astype(int))
classes = ['Over $50K', 'Less than $50K']

# To stay consistent, we have to flip the confusion 
# matrix around on both axes because sklearn's confusion matrix module by
# default is rotated.
rotated_confusion_matrix = np.fliplr(confusion_matrix(actuals, predictions))
rotated_confusion_matrix = np.flipud(rotated_confusion_matrix)

tb = widgets.TabBar(['Confusion Matrix', 'Evaluation Metrics'], location='top')

with tb.output_to('Confusion Matrix'):
  plot_confusion_matrix(rotated_confusion_matrix, classes);

with tb.output_to('Evaluation Metrics'):
  grid = widgets.Grid(2,3)

  p, r, fpr = compute_eval_metrics(actuals, predictions)

  with grid.output_to(0, 0):
    print('Precision ')
  with grid.output_to(1, 0):
    print(' %.4f ' % p)

  with grid.output_to(0, 1):
    print(' Recall ')
  with grid.output_to(1, 1):
    print(' %.4f ' % r)

  with grid.output_to(0, 2):
    print(' False Positive Rate ')
  with grid.output_to(1, 2):
    print(' %.4f ' % fpr)

會發現我們以性別作為子分類特徵會有不同的結果，男性會有：0.7490和0.4795，而女性為0.6787以及0.3716，可以看出這個模型在男性的特徵下，處理的準確度優於女性。
而這樣的結果沒有好壞，看我們是想要的為男性、女性，還是群體的預測值，有不同的看法有不同的調整，全都依我們的目標而定。

睫毛之聲：
寫完之後，還是懵懵懂懂，看來要多加練習才能理解箇中奧妙