[Day 20] Neural Network — 解決真實問題 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 20

AI & Data

ML From Scratch系列第 20 篇

[Day 20] Neural Network — 解決真實問題

15th鐵人賽 machine learning python

whoami

2023-09-20 16:43:32

486 瀏覽

分享至

今天我們透過使用 Neural Network 來完成 Natural Language Processing with Disaster Tweets

Dataset

`train.csv`

id: 每則推特文章獨特的識別符
text: 推特文章的文字
location: 推特文章的發文地點（有可能是空白）
keyword: 推特文章中的關鍵字（有可能是空白）
target: 推特文章是否跟現實災難有關，1 代表是，0 代表不是

`test.csv`

id: 每則推特文章獨特的識別符
text: 推特文章的文字
location: 推特文章的發文地點（有可能是空白）
keyword: 推特文章中的關鍵字（有可能是空白）

`sample_submission.csv`

id: 每則推特文章獨特的識別符
target: 推特文章是否跟現實災難有關，1 代表是，0 代表不是

Kaggle Notebook

Import Library

import re
import string
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score,classification_report,f1_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.filterwarnings('ignore')

Load Dataset

#loading files
train_df=pd.read_csv('../input/nlp-getting-started/train.csv')
test_df=pd.read_csv('../input/nlp-getting-started/test.csv')

Data cleaning

train_df['keyword'].fillna(train_df['keyword'].mode()[0],inplace=True)   #replacing NaN in keyword with mode values
test_df['keyword'].fillna(test_df['keyword'].mode()[0],inplace=True)

train_df['location'].fillna(train_df['location'].mode()[0],inplace=True)  #replacing NaN in location with mode values
test_df['location'].fillna(test_df['location'].mode()[0],inplace = True)

首先，我們先解決某些空白的欄位

def remove_html(text):
    html=re.compile(r'<.*?>')    
    return html.sub(r'',text)   #removing html texts

第二步，我們移除 HTML 的標籤

def remove_url(text):
    url=re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

接著，移除推文內的 URL

def remove_emoji(text):
    emoji_pattern = re.compile('['
                                u"\U0001F600-\U0001F64F"  # emoticons
                                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                                u"\U00002702-\U000027B0"
                                u"\U000024C2-\U0001F251"
                                ']+',flags=re.UNICODE)
    return emoji_pattern.sub(r'',text)

這裡，我們移除推文內的表情符號

def remove_punct(text):
    table = str.maketrans('','',string.punctuation)
    return text.translate(table)

移除推文內的標點符號

X_train,X_test,y_train,y_test=splits(training,train['target'])  #creating validation and train set

分割驗證集跟訓練集

Neural network

model= keras.Sequential([layers.Dense(units=5,activation='relu',input_shape=[input_shape]),
                        layers.Dense(units=1,activation='sigmoid')])

使用 Keras 框架建立了一個神經網路模型。這個模型由兩層疊加而成，並設定了每一層的特性。

第一層是一個具有 5 個神經元的隱藏層，activation function 為 ReLU。

這個層的輸入形狀（input_shape）是模型輸入數據的形狀，這個形狀通常用於定義模型的輸入層。

第二層是一個具有 1 個神經元的輸出層，activation function 為 sigmoid 並用於二元分類問題，

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['binary_accuracy'])

這段程式碼是用於訓練機器學習模型的一部分，通常用於二元分類問題。以下是針對程式碼的繁體中文簡介：

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['binary_accuracy'])

optimizer='adam': 設定優化器

loss='binary_crossentropy': 指定了 Loss function 的類型。

對於二元分類問題，通常使用 binary cross entropy 作為 Loss function。

它用於評估模型預測與實際目標之間的差異。

metrics=['binary_accuracy']: 定義了模型在訓練過程中要追蹤的評估指標。

在這個例子中，我們追蹤的是二元準確度，用於衡量模型在二元分類任務中的性能。

X_train, y_train, X_test, y_test = X_train.astype('float64'), y_train.astype('float64'), X_test.astype('float64'), y_test.astype('float64')

history= model.fit(X_train,y_train,
                  validation_data=(X_test,y_test),
                  batch_size=500,
                  epochs=20,
                  )

模型的訓練過程會包含以下步驟：