[自然語言處理基礎] Regular Expression (II): 文本清理

2021 iThome 鐵人賽

DAY 3

AI & Data

當自然語言處理遇上深度學習系列第 3 篇

13th鐵人賽自然語言處理 regular expression python text cleaning

Friedrich1942

2021-09-11 21:06:38

2777 瀏覽

分享至

前言

今天我們將繼續介紹正則表達式，這次的任務圍繞在自然語言處理中流程的文本清理（text cleaning）。我們或許都曾聽過Garbage in, garbage out這諺語：錯誤、無意義的數據會誘導電腦做出錯誤的決策，其後果往往比沒有數據而不作為更具有毀滅性。在大數據時代，我們不再比照過去手刻有限的規則來建構一套專家系統（expert system），而是仰賴資料的特性，讓電腦學習其規則後做出決策。正因為如此，資料是某具有代表性（潔淨度）就成了至關重要的議題。

Garbage In, Garbage Out
圖片來源：David Dibert Photos

用正則表達式來清理資料

在序章我們見識到語料庫中含有HTML標籤，另外如藉由網路抓取（web scraping）而得到的原始文字資料也經常夾雜著大量無意義的字符：

圖片來源：Cleaning Web-Scraped Data With Pandas and Regex!

以下是常見的無意義字符分類：

標點符號：如 , . ! ' "
特殊字符：如Unicode字元 @ & # &
數字：如標示引述文獻編號 [1]
空白：出現在行首、行尾及行間
HTML標籤：如 <html> 、 </body> 、 <img src="http://www.TheSiteDoesNotExist/images/my-puppies-01.jpg" alt="my puppies" />

就讓我們來看看如何清理以下HTML代碼：

原文出處：HTML Basics – Tags for Beginners

為了運用正則表達式來製造pattern，我們先引入模組 re 。這時候我們使用 re.sub() 這個函式，並且傳遞三個必要引數(required arguments)：

pattern: 正則表達式，在這裡我們可以設計為 r"<.*?>"
replacement_text: 符合pattern的字串將被更換為之，在這裡直接換成空字串 ''
input: 待比對之字串

import re

raw_text = """
<html>
   <head>
      <title>My Garden - Tomatoes</title>
   </head>
   <body>
   <h1>Garden Tomatoes</h1>
   <p>I decided to plant some tomatoes this spring. They're really taking off and I hope to have lots of tomatoes to give to all my friends and family this summer!</p>
   <p>Here are a few things I like about tomatoes:</p>
   <ol>
      <li>They taste great.</li>
      <li>They're good for me.</li>
      <li>They're easy to grow!</li>
   </ol>
   <p>Here's a picture of my garden:</p>
   <img src="http://www.mygardensite.com/images/my-garden-001.jpg" alt="a picture of my garden" />
   <p>Here's a <a href="http://www.welovetomatoes.com">link</a> to check out more interesting things about tomatoes!</p>
   </body>
</html>
"""


text_no_tags = re.sub(r"<.*?>", '', raw_text)
print(text_no_tags)

來看看執行結果：

是不是將標籤都清除了呢？接下來我們將字串中多餘的空白一並去除！
這裡我們使用了代表 whitspace、tab、換行的元字元(metacharacter) \s。由於無意義空格佔了兩個半格以上的空間，因此pattern可以設計為 \s{2,}，程式碼如下：

# to remove redundant whitespaces
text_no_whitespace = re.sub(r"\s{2,}", ' ', text_no_tags)
print(text_no_whitespace)

我們來檢視一下清除空白的文件：

關於正則表達式如何進行資料清理就先介紹到這裡，明天我們將介紹文本前處理的步驟，bis morgen!

[自然語言處理基礎] 文本預處理(I)：斷開文本的鎖練

系列文

當自然語言處理遇上深度學習共 33 篇

RSS系列文訂閱系列文

28 人訂閱

完整目錄

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22211 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

當自然語言處理遇上深度學習系列 第 3 篇