[自然語言處理基礎] Regular Expression (I): 「Ctrl + F」立馬找出關鍵字

2021 iThome 鐵人賽

DAY 2

AI & Data

當自然語言處理遇上深度學習系列第 2 篇

13th鐵人賽 nlp 自然語言處理 regular expression python

Friedrich1942

2021-09-10 19:35:58

3633 瀏覽

分享至

關鍵字搜尋與自然語言處理的關聯

在正式介紹標題所提到的regular expression之前，我們先來聊聊為什麼搜尋關鍵字與自然語言處理有關。在瀏覽網頁時，我們會習慣性地「Ctrl+F」下指令（這是在Windows作業系統，若是在MacOS則使用「⌘+F」）然後輸入我們想要搜尋的關鍵字來找到目標。你或許會問，是什麼神奇的機制讓我們得以在成百上千的字海中不費吹灰立就能定位目標字詞呢？正則表達式(regular expression, RegEx)就是推動快捷搜尋的核心！
昨天我們提到在自然語言處理的一般流程當中，我們從原始文本或語料庫獲取資料，但其往往夾雜對於文義毫無意義的字符（如下圖中的字句帶有HTML標籤 <s>和</s>等）。這時我們就需要先將字句進行文本清理(text cleaning)與前處理(pre-processing)，才能進行下一步的特徵提取，而正則表達式正好協助我們進行資料清理！

Sketch Engine語料庫中關於Covid-19的詞條使用排行
圖片來源：https://www.sketchengine.eu/corpora-and-languages/english-text-corpora/

什麼是正則表達式

在維基百科中，正則表達式是這麼被定義的：

A regular expression is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

也就是說它是由特定字符所構成，被用來搜尋符合其指定規則的其他字串。在很多程式語言中（如C++、Perl、Python）都能夠利用正則表達式來進行字串的描述與比對。

正則表達式字符與其意涵
資料來源：Computational Melody Analysis by Klaus Frieler

如何搜尋關鍵字

我們以Python來說明RegEx如何比對字串。首先引入Python標準函示庫中的模組re：

import re

我們可以透過字面常數（literal）表示符合特定規則的模式（pattern），例如我們想要在段落裡找出所有為Musk或Trump的文字，可以使用 re.findall(pattern, text) 函式來進行比對：

text = \
"Tesla doesn't have a press office. Its CEO, Elon Musk, says the company doesn't need one. Instead, in a similar way to Donald Trump, he uses Twitter rather than press releases to communicate."

# Finds every word that is either "Musk" or "Trump"
found = re.findall(r"Musk|Trump", text)
print(found) # ['Musk', 'Trump']

pattern可被理解成模板，所有text當中與之吻合的字串（字串text的不重疊子字串）都都會被收錄在列表(list)當中。值得注意的是，pattern以r"Regular_Expression"來表示，其中使用了前綴r表示原始字串（raw string）。而上述的正則表達式使用了 | 來表示替代（alternation），相當於邏輯符號中的OR。
（在正則表達式運算模組re中還定義了其他用來搜尋關鍵字的函式，如 re.match() 和 re.search() 都有各自的用法及含意，而本篇著重在正則表達式的初步描述，就不特別一一介紹了。）

我們可以用 ^ 來比對字串的開頭、用 $ 來比對結尾、也可以混用，這時候不妨加入括號 ( ) 以作為個別運算的單位：

print(re.findall(r"^Tes", text)) # ['Tes']
print(re.findall(r"cate.$", text)) # ['cate.']
print(re.findall(r"^(Tesla|Musk|Trump)", text)) # ['Tesla']
print(re.findall(r"(communicate|communicate.)$", text)) # ['communicate.']

在英文中，美英兩國習慣的拼寫方法不盡相同，如 honor v.s. honour、 realize v.s. realise 、 catalog v.s. catalogue 等等，我們可以由以下問題進行pattern的設計：

Q: 是否含有 honor 或 honour，連同開頭大小寫一並考慮？
A: 將 ? （optional quantifier）加在計算字符 u 之後，表示有無含 u 皆符合（只出現一次或未出現），再將所有符合的字元 H 和 h（並非字串）寫入[ ]（character set）。
程式碼及執行結果如下：

text = "On my honor, I will be there."

# Searches for "honor" or "honour"
# starting with a capital or lowercase letter
print(re.findall(r"[Hh]onou?r", text)) # ['honor']

時常在歌詞裡我們可以看見重複出現的字母：

Ohhhhhhh woahhhhhh When will you realize baby you're worth it Don't have to do anything to earn it. Baby you're perfect You deserve it ... ot something to say. Ohhhhhhh woahhhhhh When will you realize baby you're worth it Don't have to do anything to earn it. Baby you're perfect You deserve it ... ot something to say. Ohhhhhhh woahhhhhh When will you realize b

歌詞來源：You're Worth It - Cimorelli

若我們想要抓出所有充滿疊字的狀聲詞，可以使出 { } （fixed quantifier）。先選定重複出現的字符 \w （代指所有大小寫英文字母、阿拉伯數字0到9以及底線 _ 的字元，其表達式也可以記為 [A-Za-z0-9_]），再指定其重複次數（ {min,} : 至少出現 min 次、 {,Max} ：最多出現 Max 次、 {min,Max} ：介於兩者之間）來設計pattern：

lyrics = \
"Ohhhhhhh woahhhhhh When will you realize baby you're worth it Don't have to do anything to earn it. Baby you're perfect You deserve it ... ot something to say. Ohhhhhhh woahhhhhh When will you realize baby you're worth it Don't have to do anything to earn it. Baby you're perfect You deserve it ... ot something to say. Ohhhhhhh woahhhhhh When will you realize b"

print(re.findall(r"\w{1,3}h{3,}", lyrics)) # ['Ohhhhhhh', 'woahhhhhh', 'Ohhhhhhh', 'woahhhhhh', 'Ohhhhhhh', 'woahhhhhh']

有沒有開始感受到正則表達式的魔力了呢？關於正則表達式的基本概念就先寫到這裡，明天繼續努力！