Google ML課程筆記 - 使用神經網路建立 ML 模型 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 11 屆 iThome 鐵人賽

DAY 9

Google Developers Machine Learning

「Google Machine Learning」學習筆記系列第 9 篇

Google ML課程筆記 - 使用神經網路建立 ML 模型

11th鐵人賽

Jason Hung

2019-09-25 19:53:16

3214 瀏覽

分享至

今天下午才剛看完Lab教學(Using Neural Networks to Build a ML Model)，還來不及吸收，只好邊寫邊查了。課程使用的程式，c_neuralnetwork.ipynb，在github可以找到。我放了一份在 Colab 裡，你可以在這裡複製回去玩看看。

課程這月底要趕完，等有空再回來補完整一點。

這個教學是使用 TensorFlow 中的 DNNRegressor 類來預測房價中位數。
該數據是來自加州的1990年人口普查數據。此數據的特徵值分別反映了該街區的房間總數或該街區的總人數。

這程式看起來蠻短的，我們來試著讀懂它。

首先是先引入必要的函式庫。

shutil 負責高階的檔案處理，檔案或資料夾的移動複製，包含權限的控管
numpy 負責高階的維度陣列與矩陣運算。
pandas 負責資料分析處理
TensorFlow 負責各種任務的機器學習

如果函式庫等一下會太多地方使用到，可以用as取短一點的名字。

import shutil
import numpy
import pandas
import tensorflow as tf

先做一些設定

我們想追踪 TensorFlow 模型訓練時的資訊，所以需要將 TensorFlow 的 log級別調整為 INFO，(一共有5個級別 DEBUG，INFO，WARN，ERROR和FATAL)。再來我們希望當 pandas 要顯示資料時，列數不要超過10列(超過的部份中間的列會被刪去用...取代), 顯示的數值到小數點第一位就好。

tf.logging.set_verbosity(tf.logging.INFO)
pandas.options.display.max_rows = 10
pandas.options.display.float_format = '{:.1f}'.format

載入我們需要的資料庫

df = pandas.read_csv("https://storage.googleapis.com/ml_universities/california_housing_train.csv", sep=",")

他總筆數有1.7萬筆，總共有９個欄位。

longitude
latitude
housing_median_age
total_rooms
total_bedrooms
population
households
median_income
median_house_value

載入後，先檢查一下資料，看一下資料的樣子，

df.head()

應該可以看到一個表格顯示出來，預設是列出5筆資料出來。

再來我們可以看看他的統計數據的長相。

df.describe()

它會顯示

總筆數
均值
標準偏差
最小值
第25個百分位數
第50個百分位數
第75個百分位數
最大值

再來增加三個欄位，等一下要使用

df['num_rooms'] = df['total_rooms'] / df['households']
df['num_bedrooms'] = df['total_bedrooms'] / df['households']
df['persons_per_house'] = df['population'] / df['households']
df.describe()

把不需要的欄位刪除掉

df.drop(['total_rooms', 'total_bedrooms', 'population', 'households'], axis = 1, inplace = True)
df.describe()

featcols = {
  colname : tf.feature_column.numeric_column(colname) \
    for colname in 'housing_median_age,median_income,num_rooms,num_bedrooms,persons_per_house'.split(',')
}

featcols['longitude'] = tf.feature_column.bucketized_column(
        tf.feature_column.numeric_column('longitude'),
        numpy.linspace(-124.3, -114.3, 5).tolist())
        
featcols['latitude'] = tf.feature_column.bucketized_column(
        tf.feature_column.numeric_column('latitude'),
        numpy.linspace(32.5, 42, 10).tolist())

featcols.keys()

# Split into train and eval
msk = numpy.random.rand(len(df)) < 0.8
traindf = df[msk]
evaldf = df[~msk]

SCALE = 100000
BATCH_SIZE= 100
OUTDIR = './housing_trained'
train_input_fn = tf.estimator.inputs.pandas_input_fn(
        x = traindf[list(featcols.keys())],
        y = traindf["median_house_value"] / SCALE,
        num_epochs = None,
        batch_size = BATCH_SIZE,
        shuffle = True)
eval_input_fn = tf.estimator.inputs.pandas_input_fn(
        x = evaldf[list(featcols.keys())],
        y = evaldf["median_house_value"] / SCALE,  # note the scaling
        num_epochs = 1, 
        batch_size = len(evaldf), 
        shuffle=False)

這裡他分別建立了二個模型來預測(Linear Regressor，DNN Regressor),不過我們就先看DNN Regressor 就好了.

# DNN Regressor
def train_and_evaluate(output_dir, num_train_steps):
  myopt = tf.train.FtrlOptimizer(learning_rate = 0.01) # note the learning rate
  estimator = tf.estimator.DNNRegressor(
        model_dir = output_dir,
        hidden_units = [100, 50, 20],
        feature_columns = featcols.values(),
        optimizer = myopt,
        dropout = 0.1)
  
  #Add rmse evaluation metric
  def rmse(labels, predictions):
    pred_values = tf.cast(predictions['predictions'],tf.float64)
    return {'rmse': tf.metrics.root_mean_squared_error(labels*SCALE,pred_values*SCALE)}
  
  estimator = tf.contrib.estimator.add_metrics(estimator,rmse)
  
  train_spec=tf.estimator.TrainSpec(
        input_fn = train_input_fn,
        max_steps = num_train_steps)
  
  eval_spec=tf.estimator.EvalSpec(
        input_fn = eval_input_fn,
        steps = None,
        start_delay_secs = 1, # start evaluating after N seconds
        throttle_secs = 10,  # evaluate every N seconds
        )
        
  tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

在執行之前，先把前一次執行的資料清空。

用 shutil 的 rmtree 把之前輸出目錄(OUTDIR)的資料清空。
把之前還沒寫到檔案的暫存清空。

呼叫 train_and_evaluate 開始執行與評估。

# Run training    
shutil.rmtree(OUTDIR, ignore_errors = True) 
tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file
train_and_evaluate(OUTDIR, num_train_steps = (100 * len(traindf)) / BATCH_SIZE)

好，第９天，努力中。