現在使用TensorFlow來訓練我們的這個模組。
以下會有幾個步驟:開始準備、製作特徵、以年齡當做分類的特徵、定義模型特徵、訓練深度神經網路模型、評估神經網路的效果、製作混淆矩陣、子分類觀察。
使用:遊戲區
我們這次會用TensorFlow
的Estimator
API去執行DNNClassifier
這個class。首先必須把我們的資料集,轉換成Tensor用的pandas(使用tf.estimator.inputs.pandas_input_fn()
),這樣才能進行後續的資料處理。
def csv_to_pandas_input_fn(data, batch_size=100, num_epochs=1, shuffle=False):
return tf.estimator.inputs.pandas_input_fn(
x=data.drop('income_bracket', axis=1),
y=data['income_bracket'].apply(lambda x: ">50K" in x).astype(int),
batch_size=batch_size,
num_epochs=num_epochs,
shuffle=shuffle,
num_threads=1)
TensorFlow 需要把資料放在model內,並且製作出特徵,使用TensorFlow的tf.feature_columns
;製作數據化的特徵點(使用feature_column.numeric_column()
)。
# Since we don't know the full range of possible values with occupation and
# native_country, we'll use categorical_column_with_hash_bucket() to help map
# each feature string into an integer ID.
occupation = tf.feature_column.categorical_column_with_hash_bucket(
"occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
"native_country", hash_bucket_size=1000)
# For the remaining categorical features, since we know what the possible values
# are, we can be more explicit and use categorical_column_with_vocabulary_list()
gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])
race = tf.feature_column.categorical_column_with_vocabulary_list(
"race", [
"White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"
])
education = tf.feature_column.categorical_column_with_vocabulary_list(
"education", [
"Bachelors", "HS-grad", "11th", "Masters", "9th",
"Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
"Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
"Preschool", "12th"
])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
"marital_status", [
"Married-civ-spouse", "Divorced", "Married-spouse-absent",
"Never-married", "Separated", "Married-AF-spouse", "Widowed"
])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
"relationship", [
"Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
"Other-relative"
])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
"workclass", [
"Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
"Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
])
設定數值的特徵列
age = tf.feature_column.numeric_column("age")
fnlwgt = tf.feature_column.numeric_column("fnlwgt")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")
在把資料轉換好,成為TensorFlow所需要的格式後,我們需要做出分類的特徵值。本次使用的是年齡
,要把年齡進行分類,成為數據化的特徵(也就是ordinal feature)。
所以我們進行這幾個分類群組:18, 25, 30, 35, 40, 45, 50, 55, 60, 65
age_buckets = tf.feature_column.bucketized_column(
age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
在開始訓練模型的時候,我們要選定該模型的特徵
,這時候選性別
為子群組,並且存在subgroup_variables list
。
variables = [native_country, education, occupation, workclass,
relationship, age_buckets]
subgroup_variables = [gender]
feature_columns = variables + subgroup_variables
定義一個有著兩個隱藏層的前饋神經網絡(feed-forward neural network)。其中將高維度分類特徵轉換為低維度且密集的實值向量,被稱為嵌入向量
。
deep_columns = [
tf.feature_column.indicator_column(workclass),
tf.feature_column.indicator_column(education),
tf.feature_column.indicator_column(age_buckets),
tf.feature_column.indicator_column(gender),
tf.feature_column.indicator_column(relationship),
tf.feature_column.embedding_column(native_country, dimension=8),
tf.feature_column.embedding_column(occupation, dimension=8),
]
定義成:
HIDDEN_UNITS = [1024, 512] #@param
LEARNING_RATE = 0.1 #@param
L1_REGULARIZATION_STRENGTH = 0.0001 #@param
L2_REGULARIZATION_STRENGTH = 0.0001 #@param
model_dir = tempfile.mkdtemp()
single_task_deep_model = tf.estimator.DNNClassifier(
feature_columns=deep_columns,
hidden_units=HIDDEN_UNITS,
optimizer=tf.train.ProximalAdagradOptimizer(
learning_rate=LEARNING_RATE,
l1_regularization_strength=L1_REGULARIZATION_STRENGTH,
l2_regularization_strength=L2_REGULARIZATION_STRENGTH),
model_dir=model_dir)
接著我們使用本資料集的訓練集,進行1000次的訓練:
STEPS = 1000 #@param
single_task_deep_model.train(
input_fn=csv_to_pandas_input_fn(train_df, num_epochs=None, shuffle=True),
steps=STEPS);
最後,在遊戲場的提示會顯示如下的資料,也就是表示資料跑完的結果,以及Loss的狀況:
INFO:tensorflow:global_step/sec: 44.7529
INFO:tensorflow:loss = 31.369343, step = 901 (2.237 sec)
INFO:tensorflow:Saving checkpoints for 1000 into /tmp/tmptq25uz56/model.ckpt.
INFO:tensorflow:Loss for final step: 39.133175.
Deep neural net model is done fitting.
使用下面的程式碼進行評估:
results = single_task_deep_model.evaluate(
input_fn=csv_to_pandas_input_fn(test_df, num_epochs=1, shuffle=False),
steps=None)
print("model directory = %s" % model_dir)
print("---- Results ----")
for key in sorted(results):
print("%s: %s" % (key, results[key]))
結果如下資料,感覺效果還算不錯:
---- Results ----
accuracy: 0.8331341
accuracy_baseline: 0.7543161
auc: 0.8845352
auc_precision_recall: 0.70381314
average_loss: 0.35597283
global_step: 1000
label/mean: 0.24568394
loss: 35.502983
precision: 0.6960687
prediction/mean: 0.24560678
recall: 0.56945944
製作混淆矩陣,來評估我們的結果。
首先要做的是我們矩陣的樣子,也就是矩陣的欄位以及值(TN、TP、FN、FP):
def compute_eval_metrics(references, predictions):
tn, fp, fn, tp = confusion_matrix(references, predictions).ravel()
precision = tp / float(tp + fp)
recall = tp / float(tp + fn)
false_positive_rate = fp / float(fp + tn)
return precision, recall, false_positive_rate
製作完後,當然是要觀察囉!所以進行資料調整與視覺化調整(將使用先前的圖表),根據表去觀察結果:
def plot_confusion_matrix(confusion_matrix, class_names, figsize = (8,6)):
# We're taking our calculated binary confusion matrix that's already in form
# of an array and turning it into a Pandas DataFrame because it's a lot
# easier to work with when visualizing a heat map in Seaborn.
df_cm = pd.DataFrame(
confusion_matrix, index=class_names, columns=class_names,
)
fig = plt.figure(figsize=figsize)
# Combine the instance (numercial value) with its description
strings = np.asarray([['True Positives', 'False Negatives'],
['False Positives', 'True Negatives']])
labels = (np.asarray(
["{0:d}\n{1}".format(value, string) for string, value in zip(
strings.flatten(), confusion_matrix.flatten())])).reshape(2, 2)
heatmap = sns.heatmap(df_cm, annot=labels, fmt="");
heatmap.yaxis.set_ticklabels(
heatmap.yaxis.get_ticklabels(), rotation=0, ha='right')
heatmap.xaxis.set_ticklabels(
heatmap.xaxis.get_ticklabels(), rotation=45, ha='right')
plt.ylabel('References')
plt.xlabel('Predictions')
return fig
如果使用範例的程式碼後:
#@title Visualize Binary Confusion Matrix and Compute Evaluation Metrics Per Subgroup
CATEGORY = "gender" #@param {type:"string"}
SUBGROUP = "Male" #@param {type:"string"}
# Given define subgroup, generate predictions and obtain its corresponding
# ground truth.
predictions_dict = single_task_deep_model.predict(input_fn=csv_to_pandas_input_fn(
test_df.loc[test_df[CATEGORY] == SUBGROUP], num_epochs=1, shuffle=False))
predictions = []
for prediction_item, in zip(predictions_dict):
predictions.append(prediction_item['class_ids'][0])
actuals = list(
test_df.loc[test_df[CATEGORY] == SUBGROUP]['income_bracket'].apply(
lambda x: '>50K' in x).astype(int))
classes = ['Over $50K', 'Less than $50K']
# To stay consistent, we have to flip the confusion
# matrix around on both axes because sklearn's confusion matrix module by
# default is rotated.
rotated_confusion_matrix = np.fliplr(confusion_matrix(actuals, predictions))
rotated_confusion_matrix = np.flipud(rotated_confusion_matrix)
tb = widgets.TabBar(['Confusion Matrix', 'Evaluation Metrics'], location='top')
with tb.output_to('Confusion Matrix'):
plot_confusion_matrix(rotated_confusion_matrix, classes);
with tb.output_to('Evaluation Metrics'):
grid = widgets.Grid(2,3)
p, r, fpr = compute_eval_metrics(actuals, predictions)
with grid.output_to(0, 0):
print('Precision ')
with grid.output_to(1, 0):
print(' %.4f ' % p)
with grid.output_to(0, 1):
print(' Recall ')
with grid.output_to(1, 1):
print(' %.4f ' % r)
with grid.output_to(0, 2):
print(' False Positive Rate ')
with grid.output_to(1, 2):
print(' %.4f ' % fpr)
會發現我們以性別作為子分類特徵會有不同的結果,男性會有:0.7490
和0.4795
,而女性為0.6787
以及0.3716
,可以看出這個模型在男性的特徵下,處理的準確度優於女性。
而這樣的結果沒有好壞,看我們是想要的為男性、女性,還是群體的預測值,有不同的看法有不同的調整,全都依我們的目標而定。
睫毛之聲:
寫完之後,還是懵懵懂懂,看來要多加練習才能理解箇中奧妙