DAY 10
0
AI & Data

## 策略迭代 (Policy Iteration)

(取自 Sutton 書籍)

• Step 1 - Initialization
設定動作、狀態價值函數、獎勵函數、轉移矩陣...等內容，還有演算法中會使用到的參數。
``````## environment setting
# action
BestAction = np.random.randint(0,4,16)
ProbAction = np.zeros([16,4])
ProbAction[1:15,:] = 0.25
# value function
FuncValue = np.zeros(16)
# reward function
FuncReward = np.full(16,-1)
FuncReward[0] = 0
FuncReward[15] = 0
# transition matrix

gamma = 0.99
theta = 0.05
counter_total = 0
PolicyStable = False
``````
• Step 2 - Policy Evalution
即之前計算狀態價值的方法，為方便之後使用，將這個東西定義成 function。
``````def PolicyEvalution(func_value, best_action, func_reward, trans_mat, gamma):
func_value_now = func_value.copy()
for state in range(1,15):
next_state = trans_mat[:, state, best_action[state]]
future_reward = func_reward + func_value_now*gamma
func_value[state] = np.sum(next_state*future_reward)
delta = np.max(np.abs(func_value - func_value_now))
return func_value, delta
``````
• Step 3 - Policy Improvement
選擇每個狀態下最好的動作，一樣定義為 function 方便使用。
``````def PolicyImprovement(func_value, best_action, prob_action, func_reward, trans_mat, gamma):
policy_stable = False
best_action_now = best_action.copy()
for state in range(1,15):
prob_next_state = prob_action[state]*trans_mat[:,state,:]
future_reward = func_reward + func_value*gamma
best_action[state] = np.argmax(np.matmul(np.transpose(prob_next_state), future_reward))
if np.all(best_action == best_action_now):
policy_stable = True
return best_action, policy_stable
``````
• Run in while
我們需要重複上面 Step 2 與 3 的內容，直到我們的最佳動作已經不再改變 (程式中的變數為 PolicyStable)。
``````def main():
## environment setting
# action
BestAction = np.random.randint(0,4,16)
ProbAction = np.zeros([16,4])
ProbAction[1:15,:] = 0.25
# value function
FuncValue = np.zeros(16)
# reward function
FuncReward = np.full(16,-1)
FuncReward[0] = 0
FuncReward[15] = 0
# transition matrix

# parameters
gamma = 0.99
theta = 0.05
counter_total = 0
PolicyStable = False

# iteration
while not PolicyStable:
delta = theta + 0.001
counter = 1
counter_total += 1
while delta > theta:
FuncValue, delta = PolicyEvalution(FuncValue, BestAction, FuncReward, T, gamma)
counter += 1
os.system('cls' if os.name == 'nt' else 'clear')
ShowValue(delta, theta, gamma, counter_total, counter, FuncValue)
time.sleep(2)
BestAction, PolicyStable = PolicyImprovement(FuncValue, BestAction, ProbAction, FuncReward, T, gamma)
PolicyString = ShowPolicy(counter_total, BestAction)
time.sleep(2)

os.system('cls' if os.name == 'nt' else 'clear')
print('='*60)
print('Final Result')
print('='*60)
print('[State-value]')
print(FuncValue.reshape(4,4))
print('='*60)
print('[Policy]')
print(PolicyString.reshape(4,4))
print('='*60)
``````

``````============================================================
Final Result
============================================================
[State-value]
[[ 0.    0.   -1.   -1.99]
[ 0.   -1.   -1.99 -1.  ]
[-1.   -1.99 -1.    0.  ]
[-1.99 -1.    0.    0.  ]]
============================================================
[Policy]
[['*' '<' '<' '<']
['^' '^' '^' 'v']
['^' '^' 'v' 'v']
['^' '>' '>' '*']]
============================================================
``````