【Day 19】正規表達式

2022 iThome 鐵人賽

DAY 19

自我挑戰組

從前端角度看30天學Python系列第 19 篇

14th鐵人賽

allieschen

2022-10-03 19:42:24

803 瀏覽

分享至

re module
Match
Search
Find all
Substring
Split
不區分大小寫

這篇文章是閱讀Asabeneh的30 Days Of Python: Day 18 - Regular Expressions後的學習筆記與心得。

雖然原文只放一天在這個主題上，但因為我在閱讀過程中查找資料發現想提的東西愈來愈多，會分成兩天來說明這個主題。

Day 19 會著重在 re 這個Python內建處理正規表達式的模組及方法。
Day 20 會著重在正規表達式的模式(patterns)

正規表達式(regular expression)，簡稱RegEx，在 JavaScript (以下簡稱JS) 中則是RegExp 物件，可以用特殊的字元格式來查找資料中的相符的模式(pattern)；實際使用我都會搭配 regex101 這類能夠提供說明及快速確認搜尋結果的網站。

這章節中提到的Python方法類似JS中的String.prototype.match()或RegExp.prototype.exec()。

`re` module

Python中透過引入 re 這個模組能使用正規表達式的相關方法：

import re

這邊先放各個方法官方文的說明，從實作來看會比較好理解：

re.match：開頭比對
re.search：內容比對，回傳第一個符合的內容
re.findall：內容比對，回傳含有全部符合的內容的list
re.sub：比對符合的內容後替換成指定的內容
re.split：比對符合的內容然後以該內容為分隔依據，回傳含有分隔段落的list

Match

語法：re.match(pattern, string, flags=0)
回傳：match object | None

match()會從開頭比對，開頭沒有符合就沒了，並會區分大小寫，另外有多行(\n)的情況下，只會搜尋第一行，如果要搜尋內容或多行的話要用後面提到的search()：

? search() vs. match() | Python Docs

import re
# c 在內容中但開頭是a所以不算
print(re.match("c", "abcdef")) # None
# X 在內容中但是最後一行所以不算
print(re.match("X", "A\nB\nX", re.MULTILINE)) # None

第三個參數 re.MULTILINE 是讓方法搜尋多行，類似JS中寫string.match(/.../m)或是new RegExp("...", "m")當中的m的用途。

若是有符合的話，match object中可以透過span()取得起始及結束的索引值；可以搭配[start:end]片取符合的字串：

import re

texts = "It was the best of times, it was the worst of times;"

match = re.match("It was the best of times", texts)

span = match.span()

start, end = span

print(texts[start: end]) # It was the best of times

Search

語法：re.search(pattern, string, flags=0)
回傳：match object | None

跟match()不一樣的點是預設會搜尋整個文字內容，就算沒有加re.MULTILINE也會搜尋多行，但只會回傳第一個比對符合的值。不過可以在search pattern開頭加上^指定要從開頭比對，也就跟match()的效果一樣：

import re

print(re.search("c", "abcdef")) # <re.Match object; span=(2, 3), match='c'>

print(re.search("X", "A\nB\nX")) # <re.Match object; span=(4, 5), match='X'>

# c 在內容中但開頭是a所以不算
print(re.search("^c", "abcdef")) # None

texts = "It was the best of times, it was the worst of times;"

match = re.search("was the", texts)

span = match.span()
# 句子中有兩組 "was the"，第一組找到就回傳並結束搜尋
print(span) # (3, 10)

start, end = span
print(texts[start: end]) # was the

Find all

語法：re.findall(pattern, string, flags=0)
回傳：list[str | tuple]

import re

texts = "It was the best of times, it was the worst of times;"

match = re.findall("was the", texts)

print(match) # ['was the', 'was the']

Substring

語法：re.sub(pattern, repl, string, count=0, flags=0)
回傳：str

搜尋string參數中是否有符合pattern的值，若有的話使用repl取代，repl可以是字串或是函式；若沒有符合的話則回傳string參數內容：

import re

texts = "It was the best of times, it was the worst of times;"

match = re.sub(" ", "_", texts)
print(match)
# It_was_the_best_of_times,_it_was_the_worst_of_times;

no_match = re.sub("BEST", "WORST", texts)
print(no_match)
# It was the best of times, it was the worst of times;

# 判斷要替換的內容
def repl(match):
    if match.group(0) == "best": return "BEST"
    else: return "WORST"
# repl參數使用函式
repl_match = re.sub("best|worst", repl, texts)
print(repl_match)
# It was the BEST of times, it was the WORST of times;

pattern參數中的"|"符號代表「或」；這裡代表尋找"best"或是"worst"兩個字詞。

Split

語法：re.split(pattern, string, maxsplit=0, flags=0)
回傳：list

搜尋string參數中是否有符合pattern的內容，並用其作為分隔依據，回傳分隔後(不包含分隔符)的段落list，如果有給maxsplit的話，超出該值的部分會做為一個段落被放進回傳的list中：

import re

texts = "It was the best of times, it was the worst of times;"

match = re.split(r"\W+", texts)
print(match)
# ['It', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times', '']

no_match = re.split("BEST", texts)
print(no_match)
# ['It was the best of times, it was the worst of times;']

match_split = re.split(r"\W+", texts, maxsplit=1)
print(match_split)
# ['It', 'was the best of times, it was the worst of times;']

pattern參數中的\W+是指任何非字元[^a-zA-Z0-9_]的字。

不區分大小寫

match, search, findall在不指定第三個參數的情況下，都會區分大小寫；而要不區分大小寫有幾種方式：

可以透過給re.I或re.IGNORECASE到第三個參數(flags)的方式讓搜尋不區分大小寫：

import re

texts = "It was the best of times, it was the worst of times;"

match = re.findall("IT", texts, re.I)

print(match) # ['It', 'it']

或是在搜尋參數使用patterns：

import re

texts = "It was the best of times, it was the worst of times;"

match_1 = re.findall("[Ii][Tt]", texts)
print(match_1) # ['It', 'it']

match_2 = re.findall("IT|It|it", texts)
print(match_2) # ['It', 'it']