.

iT邦幫忙

第 12 屆 iThome 鐵人賽

DAY 8
0

Q learning如何實現

今天我們就要來看看如何實現Q learning!
code參考這篇製作Q* Learning with FrozenLakev2.ipynb

前置作業

我們使用Colab來當作我們的實作平台,並使用Open ai來完成。

FrozenLake

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend.
擷取至https://gym.openai.com/envs/FrozenLake-v0/

簡單來說,就是要讓玩家到達目的地就贏了。

地圖

4X4的地圖

S代表出發點,F是可以走的路,H是破洞(走到會死掉),G是終點(走到就贏了)

動作

上下左右

狀態

共有16格,所以有16個state

Q table

因為每一個狀態都可以有4種動作,所以Q table大小為
$4×4×4=64$

初始化

先生成一個都為0的Q表
透過env.action_space.n以及env.observation_space.n就可以得到Q表的長以及寬。

import numpy as np
import gym
import random
env = gym.make("FrozenLake-v0")
action_size = env.action_space.n
state_size = env.observation_space.n
# Create our Q table with state_size rows and action_size columns (64x4)
qtable = np.zeros((state_size, action_size))

參數設置

total_episodes = 20000       # Total episodes
learning_rate = 0.8          # Learning rate
max_steps = 50               # Max steps per episode
gamma = 0.95                 # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.01            # Minimum exploration probability 
decay_rate = 0.005            # Exponential decay rate for exploration prob

演算法

# List of rewards
rewards = []

# 2 For life or until learning is stopped
for episode in range(total_episodes):
    # Reset the environment
    state = env.reset()
    step = 0
    done = False
    total_rewards = 0
    
    for step in range(max_steps):
        # 3. Choose an action a in the current world state (s)
        ## First we randomize a number
        exp_exp_tradeoff = random.uniform(0, 1)
        
        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)
        if exp_exp_tradeoff > epsilon:
            action = np.argmax(qtable[state,:])
            #print(exp_exp_tradeoff, "action", action)

        # Else doing a random choice --> exploration
        else:
            action = env.action_space.sample()
            #print("action random", action)
            
        
        # Take the action (a) and observe the outcome state(s') and reward (r)
        new_state, reward, done, info = env.step(action)

        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
        # qtable[new_state,:] : all the actions we can take from new state
        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])
        total_rewards += reward
        
        # Our new state is state
        state = new_state
        
        # If done (if we're dead) : finish episode
        if done == True: 
            break
        
    # Reduce epsilon (because we need less and less exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) 
    rewards.append(total_rewards)
    

print ("Score over time: " +  str(sum(rewards)/total_episodes))
print(qtable)

可以看到在第18行的部分,如果exp_exp_tradeoff大於epsilon時,就會以Q表的結果來選擇動作,否則就隨機做一個動作。

33行的部分即是Q learning的精華!複習一下

另外在第44行的部分,為了減少探索的機率,因此會慢慢地減少epsilon值。

# 輸入這行即可得到環境目前的狀況
env.render()

結果


最快13步即可達到目的地!

結論

今天看到了Q learning的實作,並了解其中如何實現。

參考資料

https://gym.openai.com/envs/FrozenLake-v0/
Q* Learning with FrozenLakev2.ipynb


上一篇
Day 7 強化學習之Q learning
下一篇
Day 9 DQN是不良人物?!
系列文
Machine Learning與軟工是否搞錯了什麼?30
.
圖片
  直播研討會

尚未有邦友留言

立即登入留言