Day28-NLP自然語言處理介紹

2019 iT 邦幫忙鐵人賽

DAY 28

AI & Data

大數據的世代需學會的幾件事系列第 28 篇

2019鐵人賽

queenawu

2018-11-12 23:52:52

9901 瀏覽

分享至

很多線上社群網站會蒐集使用者的資料，並且分析使用者行為，像是知名的Facebook在前幾年開始做「情緒分析(sentiment analysis)」，它是以文字分析、自然語言處理NLP的方法，找出使用者的評價、情緒，進而預測出使用者行為來進行商業決策，像這樣一連串利用情緒分析帶來的商業價值是相當可觀的。而今天以IMDb網路電影影評資料集作範例。

在之前有講解到，建立一個模型，要有大量的訓練(training)後來做預測(predict)，如下圖，在訓練時蒐集正反兩面的資料的特徵值及屬性來建立深度學習模型，再利用預測資料將資料做預處理找出特徵值，透過實作完成的深度學習模式做預測，並且計算出預測結果與真實值得差異，計算出建立完成的深度學習模型的準確值如何。

先下載這是範例需要的資料當IMDb的資料集，並匯入模組，進行檔案下載、確認檔案、解壓縮

import urllib.request
import os
import tarfile

url="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
filepath="IMDb/aclImdb_v1.tar.gz"
if not os.path.isfile(filepath):
    result=urllib.request.urlretrieve(url,filepath)
    print('downloaded:',result)

讀取IMDb資料

from keras.datasets import imdb
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer

#將html的標籤刪除
import re
def rm_htmltags(text):
    re_tag = re.compile(r'<[^>]+>')
    return re_tag.sub('', text)

#讀取IMDb的檔案目錄
import os
def read_files(filetype):
    path = "IMDb/aclImdb/"
    file_list=[]

    positive_path=path + filetype+"/pos/"
    for f in os.listdir(positive_path):
        file_list+=[positive_path+f]
    
    negative_path=path + filetype+"/neg/"
    for f in os.listdir(negative_path):
        file_list+=[negative_path+f]
        
    print('read',filetype, 'files:',len(file_list))
       
    all_labels = ([1] * 12500 + [0] * 12500) 
    
    all_texts  = []
    for fi in file_list:
        with open(fi,encoding='utf8') as file_input:
            all_texts += [rm_tags(" ".join(file_input.readlines()))]
            
    return all_labels,all_texts
    
#查看IMDb的檔案目錄

使用Tokenizer建立token

#先讀取所有文章建立dic，限制dic數量為nb_words=2500
token = Tokenizer(num_words=2500)
token.fit_on_texts(train_text)
#查看文字index的屬性

Day27-卷積神經網路

Day29-Keras多層感知器MLP進行IMDb情緒分析

系列文

大數據的世代需學會的幾件事共 30 篇

RSS系列文訂閱系列文

94 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

大數據的世代需學會的幾件事系列 第 28 篇

Day28-NLP自然語言處理 介紹

尚未有邦友留言

標記使用者

大數據的世代需學會的幾件事系列第 28 篇

Day28-NLP自然語言處理介紹