自然語言理解

2018 iT 邦幫忙鐵人賽

DAY 19

AI & Machine Learning

探索 Microsoft CNTK 機器學習工具系列第 19 篇

2018鐵人賽

HO-HSUN

2018-01-07 23:01:18

1857 瀏覽

分享至

Introduction

自然語言理解(Natural language understanding, NLU)是相當複雜的項目，從文字線性降維，擴展到提出關鍵詞，直到了解辭意都相當困難。

終端資料自動廣播服務(Automatic Terminal Information Service, ATIS)資料集用於分詞標記學習，使用 LSTM 神經網路訓練，完成一個分詞訓練。

遞歸神經網絡(recurrent neural network, RNN)結合 LSTM 有很好的學習效果。

詞嵌入使用向量來表示每個字詞，方便分析處理。

Tasks

物件宣告。

from __future__ import print_function
import requests
import os

import math
import numpy as np

import cntk as C
import cntk.tests.test_utils
cntk.tests.test_utils.set_device_from_pytest_env()
C.cntk_py.set_fixed_random_seed(1)

1.資料讀取(Data reading)：

下載 ATIS 資料集。

def download(url, filename):
    """ utility function to download a file """
    response = requests.get(url, stream=True)
    with open(filename, "wb") as handle:
        for data in response.iter_content():
            handle.write(data)

locations = ['Tutorials/SLUHandsOn', 'Examples/LanguageUnderstanding/ATIS/BrainScript']

data = {
  'train': { 'file': 'atis.train.ctf', 'location': 0 },
  'test': { 'file': 'atis.test.ctf', 'location': 0 },
  'query': { 'file': 'query.wl', 'location': 1 },
  'slots': { 'file': 'slots.wl', 'location': 1 },
  'intent': { 'file': 'intent.wl', 'location': 1 }  
}

for item in data.values():
    location = locations[item['location']]
    path = os.path.join('..', location, item['file'])
    if os.path.exists(path):
        print("Reusing locally cached:", item['file'])
        # Update path
        item['file'] = path
    elif os.path.exists(item['file']):
        print("Reusing locally cached:", item['file'])
    else:
        print("Starting download:", item['file'])
        url = "https://github.com/Microsoft/CNTK/blob/release/2.3.1/%s/%s?raw=true"%(location, item['file'])
        download(url, item['file'])
        print("Download completed")

2.資料處理(Data preprocessing)：

ATIS 資料集，是機場播放的訊息資料，訓練如何標記出一個詞彙屬於某個特定的分類標籤。

資料有 9 個欄位，以 | 符號作為分隔，訓練目的是透過 S0 欄位預測 S2 分類標籤。

sequence id：19，每個資料樣本的序號，相同序號視為同一個資料樣本。
column S0：共有 943 個詞彙，這裡表示一個詞彙，對應詞彙表中的某一個詞彙。
#：註釋，BOS 表示每個資料樣本的開頭， EOS 表示每個資料樣本的結尾。
column S1：意圖標籤。
#：註釋，意圖標籤的說明。
column S2：分類標籤。
#：註釋，分類標籤的說明，0 空標籤，B- 分類標籤，I- 與上一個詞相連。

atis.test.ctf 的資料樣本：
19 |S0 178:1 |# BOS |S1 14:1 |# flight |S2 128:1 |# O
19 |S0 770:1 |# show |S2 128:1 |# O
19 |S0 429:1 |# flights |S2 128:1 |# O
19 |S0 444:1 |# from |S2 128:1 |# O
19 |S0 272:1 |# burbank |S2 48:1 |# B-fromloc.city_name
19 |S0 851:1 |# to |S2 128:1 |# O
19 |S0 789:1 |# st. |S2 78:1 |# B-toloc.city_name
19 |S0 564:1 |# louis |S2 125:1 |# I-toloc.city_name
19 |S0 654:1 |# on |S2 128:1 |# O
19 |S0 601:1 |# monday |S2 26:1 |# B-depart_date.day_name
19 |S0 179:1 |# EOS |S2 128:1 |# O

3.建立模型(Model creation)：

# 詞彙表中的單詞數，分類標籤和意圖標籤
vocab_size = 943 ; num_labels = 129 ; num_intents = 26    

# 模型維度
input_dim  = vocab_size
label_dim  = num_labels
emb_dim    = 150
hidden_dim = 300

# x：特徵值(feature)
# y：標籤值(label)
x = C.sequence.input_variable(vocab_size)
y = C.sequence.input_variable(num_labels)

宣告函式：建立模型。
Embedding：線性嵌入層
Recurrence：遞迴神經層
Dense：全連接層

def create_model():
    with C.layers.default_options(initial_state=0.1):
        return C.layers.Sequential([
            C.layers.Embedding(emb_dim, name='embed'),
            C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False),
            C.layers.Dense(num_labels, name='classify')
        ])

建立模型。

z = create_model()
print(z.embed.E.shape)
print(z.classify.b.value)

圖像辨識

增強式學習

系列文

探索 Microsoft CNTK 機器學習工具共 30 篇

RSS系列文訂閱系列文

20 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22209 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

探索 Microsoft CNTK 機器學習工具系列 第 19 篇