DAY 8
0
AI & Data

## [Day 8] TensorFlow programming exercises

Ref.: TensorFlow exercises

## Quick Introduction to pandas

• 認識Dataframe, Series
• 操作Dataframe, Series
• import data from csv
• 用reindex打亂data

``````from __future__ import print_function

import pandas as pd
pd.__version__
``````

Dataframe是由rows跟named columns組成，而Series則是指單一一欄。也就是說，Dataframe由Series跟每個Series的名稱組成。

``````city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199]) #少一個會怎樣?

pd.DataFrame({ 'City name': city_names, 'Population': population })
``````

``````california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe.describe()
``````

#### 接著是對Dataframe的一些操作

``````#我修改了某部分的code，讓大家看一些細微差異
cities = pd.DataFrame({ 'City name': city_names, 'Population': population })

# sub-dataframe
print(type(cities[['City name']]))
print(cities[['City name']])

# Series
print(type(cities['City name']))
print(cities['City name'])
``````

See, `[[]]`會變成Dataframe, `[]`則是回傳Series。

``````# Dataframe中'City name' series的第二個
print(type(cities['City name'][1]))
cities['City name'][1]

# Dataframe中第0列開始取2列
print(type(cities[0:2]))
cities[0:2]
``````

``````import numpy as np

# 取populatioin series 的log運算
np.log(population)

# Series 每個除以1000
population / 1000.

# 每個element做個別運算(val > 1000000)
population.apply(lambda val: val > 1000000)
``````

``````cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])
# Populatioin density不存在在原本的Cities dataframe裡
cities['Population density'] = cities['Population'] / cities['Area square miles']
cities
``````

Exercise 1 賦予一個新欄位，意義是"名稱是San開頭，且area > 50"
Solution 打開就看到答案囉

``````#都加上了print，範例中沒加是因為它code在最後執行，會印出執行結果(function定義除外)
print(city_names.index) # Series有 index
print(cityies.index)    # Dataframe 也有 index
cities.reindex(np.random.permutation(cities.index))
``````

Exercise 2 玩看看如果reindex超出個數、或傳入的array中有element不在原本index裡會怎樣。

## First Steps with TensorFlow

• LinearRegressor
• Predict
• Evaluate with Root Mean Squared Error
• Improve hyperparameters

``````from __future__ import print_function

import math

from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format

# import data

# Prepare to run SGD
california_housing_dataframe = california_housing_dataframe.reindex(
np.random.permutation(california_housing_dataframe.index))
# Scale median_house_value values
california_housing_dataframe["median_house_value"] /= 1000.0
``````

1. Categorical Data: 可分類的數據資料
2. Numerical Data: 數值類型的資料
在TensorFlow裡要先使用feature_columns construct儲存feature column的定義，如下面的`tf.feature_column.numeric_column`
``````# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]] #dataframe

# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]
print(feature_columns) # output: [_NumericColumn(key='total_rooms', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]
``````

``````targets = california_housing_dataframe["median_house_value"]
# Use gradient descent as the optimizer for training the model.

# Configure the linear regression model with our feature columns and optimizer.
# Set a learning rate of 0.0000001 for Gradient Descent.
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=feature_columns,
optimizer=my_optimizer
)
``````

``````def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
"""Trains a linear regression model of one feature.

Args:
features: pandas DataFrame of features
targets: pandas DataFrame of targets
batch_size: Size of batches to be passed to the model
shuffle: True or False. Whether to shuffle the data.
num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
Returns:
Tuple of (features, labels) for next data batch
"""

# Convert pandas data into a dict of np arrays.

# print(dict(features).items())
# print(dict(features).items()[0])
# print(dict(features).items()[0][1])
# print(np.array(dict(features).items()[0][1]))

features = {key:np.array(value) for key,value in dict(features).items()}
# Construct a dataset, and configure batching/repeating.
ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
ds = ds.batch(batch_size).repeat(num_epochs)

# Shuffle the data, if specified.
if shuffle:
ds = ds.shuffle(buffer_size=10000)

# Return the next batch of data.
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels
``````

``````_ = linear_regressor.train(
input_fn = lambda:my_input_fn(my_feature, targets),
steps=100
)

prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)
predictions = linear_regressor.predict(input_fn=prediction_input_fn)
predictions = np.array([item['predictions'][0] for item in predictions])

# Print Mean Squared Error and Root Mean Squared Error.
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)
print("Mean Squared Error (on training data): %0.3f" % mean_squared_error)
print("Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error)
``````

``````sample = california_housing_dataframe.sample(n=300)
# Get the min and max total_rooms values.
x_0 = sample["total_rooms"].min()
x_1 = sample["total_rooms"].max()

# Retrieve the final weight and bias generated during training.
weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

# Get the predicted median_house_values for the min and max total_rooms values.
y_0 = weight * x_0 + bias
y_1 = weight * x_1 + bias

# Plot our regression line from (x_0, y_0) to (x_1, y_1).
plt.plot([x_0, x_1], [y_0, y_1], c='r')

# Label the graph axes.
plt.ylabel("median_house_value")
plt.xlabel("total_rooms")

# Plot a scatter plot from our data sample.
plt.scatter(sample["total_rooms"], sample["median_house_value"])

# Display graph.
plt.show()
``````