iT邦幫忙

2023 iThome 鐵人賽

DAY 30
1
AI & Data

ML From Scratch系列 第 30

[Day 30] Deep Q-Network — 解決真實問題

  • 分享至 

  • xImage
  •  

昨天,我們透過 TUTORIAL 來講解 Deep Q-Network。

/images/emoticon/emoticon06.gif

今天我們來探索 Deep Deterministic Policy Gradient (DDPG)

What's relation between DQN ?

深度決策性策略梯度(DDPG)和深度 Q 網絡(DQN)都是強化學習算法,但在方法和應用方面有一些關鍵區別:

  • DDPG(深度決策性策略梯度) 專注於學習策略,即從狀態到行動的映射。它試圖直接預測在給定狀態下應該採取的最佳行動。DDPG 適用於具有連續行動空間的問題。
  • DQN(深度 Q 網絡) 則專注於學習值函數。它估計了在特定狀態下採取特定行動的價值。DQN通常用於具有離散行動空間的問題。

以下會透過一些比較來說明 DQN 和 DDPG 的差異。

Action Space

DDPG 適用於連續行動空間的環境,適用於行動是實值且可能有各種可能值的任務。

相反,DQN 適用於具有有限一組可能行動的離散行動空間。

Exploration

DDPG 使用確定性策略,這意味著在需要探索以發現最優策略的環境中可能會遇到困難。

DQN 常常使用 epsilon-greedy 探索策略,允許在訓練過程中嘗試不同的行動。

Target Networks

DDPG 和 DQN 都使用目標網絡來穩定訓練。

在 DDPG 中,actor(策略)和 critic(價值函數)都有目標網絡,而在 DQN 中,有一個用於 Q 值函數的目標網絡。

Experience Replay

這兩種算法通常使用經驗回放來存儲和抽樣過去的經驗(狀態、行動、獎勵、下一個狀態),以打破樣本之間的時間相關性,提高學習穩定性。

Deep Deterministic Policy Gradient

深度決策性策略梯度(Deep Deterministic Policy Gradient,DDPG)是一種強化學習演算法,它主要用於解決連續動作空間的問題。以下是對DDPG的簡要解釋:

DDPG使用深度神經網絡(Deep Neural Networks)來近似和優化策略(Policy)和值函數(Value Function)。這些神經網絡可以處理高維度的狀態和動作空間。

而與一些強化學習方法不同,DDPG學習一個決策性策略,即對於給定的狀態,它直接預測應該採取的動作,而不是生成一個動作的概率分佈。

DDPG使用策略梯度方法,通過梯度上升來最大化期望累積回報。這意味著它試圖調整策略以最大程度地增加獲得長期回報的可能性。通過計算梯度,以便在訓練過程中更新策略神經網絡的參數,使其更好地適應環境。

此外,Deep Deterministic Policy Gradient 還使用了價值函數,用於估計每個狀態的預期回報,這有助於評估採取的策略的好壞。

Implementation

Import Library

import sys
import gym
import numpy as np
import os
import time
import random
from collections import namedtuple
import torch
import torch.nn as nn
from torch.optim import Adam
from torch.autograd import Variable
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter

Define a tensorboard writer

writer = SummaryWriter("./tb_record_1")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def soft_update(target, source, tau):
    for target_param, param in zip(target.parameters(), source.parameters()):
        target_param.data.copy_(target_param.data * (1.0 - tau) + param.data * tau)

def hard_update(target, source):
    for target_param, param in zip(target.parameters(), source.parameters()):
        target_param.data.copy_(param.data)

Transition = namedtuple(
    'Transition', ('state', 'action', 'mask', 'next_state', 'reward'))

ReplayMemory

class ReplayMemory(object):

    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0

    def push(self, *args):
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position + 1) % self.capacity

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

Ornstein-Uhlenbeck

class OUNoise:

    def __init__(self, action_dimension, scale=0.1, mu=0, theta=0.15, sigma=0.2):
        self.action_dimension = action_dimension
        self.scale = scale
        self.mu = mu
        self.theta = theta
        self.sigma = sigma
        self.state = np.ones(self.action_dimension) * self.mu
        self.reset()

    def reset(self):
        self.state = np.ones(self.action_dimension) * self.mu

    def noise(self):
        x = self.state
        dx = self.theta * (self.mu - x) + self.sigma * np.random.randn(len(x))
        self.state = x + dx
        return self.state * self.scale    

這函式用於生成一種稱為 Ornstein-Uhlenbeck 噪音的隨機值。

這種噪音通常用於增加在強化學習中的策略探索過程中的隨機性,以幫助智能代理學習更好的策略。

Actor

class Actor(nn.Module):
    def __init__(self, hidden_size, num_inputs, action_space):
        super(Actor, self).__init__()
        self.action_space = action_space
        num_outputs = action_space.shape[0]

        # Construct your own actor network

        # Actor network
        self.actor_layer = nn.Sequential(
            nn.Linear(num_inputs, 400, device=device),
            nn.ReLU(),
            nn.Linear(400, 300, device=device),
            nn.ReLU(),
            nn.Linear(300, num_outputs, device=device),
            nn.Tanh()
        )

        '''
        Actor network structure
        Layer (type)               Output Shape         Param #
        =========================================================
        Linear-1                   [-1, 400]            32,000
        ReLU-2                     [-1, 400]            0
        Linear-3                   [-1, 300]            120,300
        ReLU-4                     [-1, 300]            0
        Linear-5                   [-1, 1]              301
        Tanh-6                     [-1, 1]              0
        =========================================================
        '''

        
    def forward(self, inputs):
      
        # Define the forward pass your actor network

        out = self.actor_layer(inputs)
        
        return out

這裡我們定義了一個包含多個線性層(Linear)和激活函數(ReLU 和 Tanh)的神經網絡,用於近似策略(Actor)。

這個 Actor 網絡將環境的狀態(inputs)作為輸入,並輸出連續動作的值。

Critic

class Critic(nn.Module):
    def __init__(self, hidden_size, num_inputs, action_space):
        super(Critic, self).__init__()
        self.action_space = action_space
        num_outputs = action_space.shape[0]

        # Construct your own critic network

        # Shared layer: state
        self.state_layer = nn.Sequential(
            nn.Linear(num_inputs, 400, device=device),
            nn.ReLU(),
        )

        # Shared layer: state and action
        self.shared_layer = nn.Sequential(
            nn.Linear(num_outputs + 400, 300, device=device),
            nn.ReLU(),
            nn.Linear(300, 1, device=device),
        )

        '''
        Critic network structure
        Layer (type)               Output Shape         Param #
        =========================================================
        Linear-1                   [-1, 400]            32,000
        ReLU-2                     [-1, 400]            0
        Linear-3                   [-1, 300]            120,300
        ReLU-4                     [-1, 300]            0
        Linear-5                   [-1, 1]              301
        =========================================================
        '''


    def forward(self, inputs, actions):

        # Define the forward pass your critic network
        
        out = self.state_layer(inputs)
        out = self.shared_layer(torch.cat([out, actions], dim=1))
        
        return out

這裡定義了一個包含多個線性層(Linear)和激活函數(ReLU)的神經網絡,用於近似價值函數(Critic)。

Critic 網絡接受環境的狀態(inputs)和代理器(Actor)的動作(actions)作為輸入,並輸出評估的價值值,該值表示狀態和動作的好壞。

DDPG

class DDPG(object):
    def __init__(self, num_inputs, action_space, gamma=0.995, tau=0.0005, hidden_size=128, lr_a=1e-4, lr_c=1e-3):

        self.num_inputs = num_inputs
        self.action_space = action_space

        self.actor = Actor(hidden_size, self.num_inputs, self.action_space)
        self.actor_target = Actor(hidden_size, self.num_inputs, self.action_space)
        self.actor_perturbed = Actor(hidden_size, self.num_inputs, self.action_space)
        self.actor_optim = Adam(self.actor.parameters(), lr=lr_a)

        self.critic = Critic(hidden_size, self.num_inputs, self.action_space)
        self.critic_target = Critic(hidden_size, self.num_inputs, self.action_space)
        self.critic_optim = Adam(self.critic.parameters(), lr=lr_c)

        self.gamma = gamma
        self.tau = tau

        hard_update(self.actor_target, self.actor) 
        hard_update(self.critic_target, self.critic)


    def select_action(self, state, action_noise=None):
        self.actor.eval()
        mu = self.actor((Variable(state.to(device))))
        mu = mu.data

        # Add noise to your action for exploration
        # Clipping might be needed 

        self.actor.train()

        # add noise to action
        if action_noise is not None:
            mu += torch.tensor(action_noise).to(device)

        # clip action, set action between -1 and 1
        return torch.clamp(mu, -1, 1).cpu()


    def update_parameters(self, batch):
        state_batch = Variable(batch.state)
        action_batch = Variable(batch.action)
        reward_batch = Variable(batch.reward)
        mask_batch = Variable(batch.mask)
        next_state_batch = Variable(batch.next_state)

        # Calculate policy loss and value loss
        # Update the actor and the critic

        # predict next action and Q-value in next state
        next_action_batch = self.actor_target(next_state_batch)
        next_state_action_values = self.critic_target(next_state_batch, next_action_batch)

        # compute TD target, set Q_target to 0 if next state is terminal
        Q_targets = reward_batch + (self.gamma * next_state_action_values * (1 - mask_batch))

        # predict Q-value in current state
        state_action_batch = self.critic(state_batch, action_batch)
        
        # compute critic loss (MSE loss)
        value_loss = F.mse_loss(state_action_batch, Q_targets)

        self.critic_optim.zero_grad()
        value_loss.backward()
        self.critic_optim.step()

        # predict action in current state
        actions_pred = self.actor(state_batch)
        
        # compute actor loss (policy gradient)
        policy_loss = -self.critic(state_batch, actions_pred).mean()

        self.actor_optim.zero_grad()
        policy_loss.backward()
        self.actor_optim.step()

        soft_update(self.actor_target, self.actor, self.tau)
        soft_update(self.critic_target, self.critic, self.tau)

        return value_loss.item(), policy_loss.item()


    def save_model(self, env_name, suffix="", actor_path=None, critic_path=None):
        local_time = time.localtime()
        timestamp = time.strftime("%m%d%Y_%H%M%S", local_time)
        if not os.path.exists('preTrained/'):
            os.makedirs('preTrained/')

        if actor_path is None:
            actor_path = "preTrained/ddpg_actor_{}_{}_{}".format(env_name, timestamp, suffix) 
        if critic_path is None:
            critic_path = "preTrained/ddpg_critic_{}_{}_{}".format(env_name, timestamp, suffix) 
        print('Saving models to {} and {}'.format(actor_path, critic_path))
        torch.save(self.actor.state_dict(), actor_path)
        torch.save(self.critic.state_dict(), critic_path)

    def load_model(self, actor_path, critic_path):
        print('Loading models from {} and {}'.format(actor_path, critic_path))
        if actor_path is not None:
            self.actor.load_state_dict(torch.load(actor_path))
        if critic_path is not None: 
            self.critic.load_state_dict(torch.load(critic_path))

這裡實做 DDPG,用於解決連續動作空間的強化學習問題。

包括 Actor 和 Critic 兩個神經網絡,用於近似策略和價值函數。

DDPG 通過不斷地選擇動作並優化策略,以最大程度地增加預期累積回報,來訓練智能代理。

包括了噪音的添加以促進探索、模型參數的保存和加載等功能,使其成為一個完整的深度強化學習工具。

Train the DDPG

def train():    
    num_episodes = 200
    gamma = 0.995
    tau = 0.002
    hidden_size = 128
    noise_scale = 0.3
    replay_size = 100000
    batch_size = 128
    updates_per_step = 1
    print_freq = 1
    ewma_reward = 0
    rewards = []
    ewma_reward_history = []
    total_numsteps = 0
    updates = 0

    
    agent = DDPG(env.observation_space.shape[0], env.action_space, gamma, tau, hidden_size)
    ounoise = OUNoise(env.action_space.shape[0])
    memory = ReplayMemory(replay_size)
    
    for i_episode in range(num_episodes):
        
        ounoise.scale = noise_scale
        ounoise.reset()
        
        state = torch.Tensor(env.reset())

        episode_reward = 0
        value_loss, policy_loss = 0, 0
        while True:
            # 1. Interact with the env to get new (s,a,r,s') samples 
            # 2. Push the sample to the replay buffer
            # 3. Update the actor and the critic

            # select action and interact with the environment
            # add noise to action for exploration
            action = agent.select_action(state, ounoise.noise() * noise_scale)
            next_state, reward, done, _ = env.step(action.numpy())
            
            # add sample to replay buffer
            # convert to numpy array, since replay buffer only accepts numpy array
            memory.push(state.numpy(), action.numpy(), done, next_state, reward)

            # update the actor and the critic
            if memory.__len__() > batch_size:
                experiences_batch = memory.sample(batch_size)

                # convert to Transition object
                # Since the replay buffer stores numpy array, we need to convert them to torch tensor
                # and move them to GPU
                experiences_batch = Transition(state=torch.from_numpy(np.vstack([i.state for i in experiences_batch])).to(torch.float32).to(device),
                                               action=torch.from_numpy(np.vstack([i.action for i in experiences_batch])).to(torch.float32).to(device),
                                               mask=torch.from_numpy(np.vstack([i.mask for i in experiences_batch])).to(torch.uint8).to(device),
                                               next_state=torch.from_numpy(np.vstack([i.next_state for i in experiences_batch])).to(torch.float32).to(device),
                                               reward=torch.from_numpy(np.vstack([i.reward for i in experiences_batch])).to(torch.float32).to(device))
                
                # update the actor and the critic
                value_loss, policy_loss = agent.update_parameters(experiences_batch)
            
            # update the state
            state = torch.Tensor(next_state).clone()
            episode_reward += reward

            if done:
                break           

        rewards.append(episode_reward)
        t = 0
        if i_episode % print_freq == 0:
            state = torch.Tensor([env.reset()])
            episode_reward = 0
            while True:
                action = agent.select_action(state)

                next_state, reward, done, _ = env.step(action.numpy()[0])
                
                env.render()
                
                episode_reward += reward

                next_state = torch.Tensor([next_state])

                state = next_state
                
                t += 1
                if done:
                    break

            rewards.append(episode_reward)
            # update EWMA reward and log the results
            ewma_reward = 0.05 * episode_reward + (1 - 0.05) * ewma_reward
            ewma_reward_history.append(ewma_reward)           
            print("Episode: {}, length: {}, reward: {:.2f}, ewma reward: {:.2f}".format(i_episode, t, rewards[-1], ewma_reward))

            # write results to tensorboard
            writer.add_scalar('Reward/ewma', ewma_reward, i_episode)
            writer.add_scalar('Reward/ep_reward', ewma_reward, i_episode)
            writer.add_scalar('Loss/value', value_loss, i_episode)
            writer.add_scalar('Loss/policy', policy_loss, i_episode)

這裡實做訓練過程,使用 DDPG 來訓練智能代理以解決強化學習問題。

  1. 設定超參數和初始化環境,代理,噪音等元件。
  2. 開始訓練迴圈,迴圈中的每一個迴圈代表一個訓練的回合(episode)。
  3. 在每個回合中,代理根據當前的策略與環境互動,選擇動作,並獲得環境的回饋(獎勵)。
  4. 將每個互動的觀測數據(state, action, reward, next_state, done)儲存到回放緩衝區(Replay Memory)中。
  5. 如果回放緩衝區中的數據量足夠大,則隨機從回放緩衝區中抽樣一個批次的數據。
  6. 使用這個批次的數據來更新 Actor 和 Critic 神經網絡的參數,以最大化預期總回報。
  7. 記錄每個回合的回饋(總獎勵),計算指數加權移動平均(EWMA)總獎勵,並將結果寫入 TensorBoard 進行可視化。
  8. 每隔一定的訓練回合,使用訓練好的代理在環境中運行,並渲染出來觀察其行為。
  9. 繼續進行多個訓練回合,直到達到指定的回合數(num_episodes)為止。
    agent.save_model(env_name='Pendulum-v1', suffix="DDPG")

最後,保存訓練好的代理模型


詳細程式碼可以參考 DDPG

/images/emoticon/emoticon01.gif


上一篇
[Day 29] Deep Q-Network — 主題實作
下一篇
[Day 31] Rewind
系列文
ML From Scratch31
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

1 則留言

1
艾薇 Ivy
iT邦新手 4 級 ‧ 2023-10-01 01:14:27

辛苦了!恭喜完賽!

whoami iT邦新手 1 級 ‧ 2023-10-01 08:26:22 檢舉

謝謝支持

我要留言

立即登入留言