re
module這篇文章是閱讀Asabeneh的30 Days Of Python: Day 18 - Regular Expressions後的學習筆記與心得。
雖然原文只放一天在這個主題上,但因為我在閱讀過程中查找資料發現想提的東西愈來愈多,會分成兩天來說明這個主題。
re
這個Python內建處理正規表達式的模組及方法。正規表達式(regular expression),簡稱RegEx,在 JavaScript (以下簡稱JS) 中則是RegExp
物件,可以用特殊的字元格式來查找資料中的相符的模式(pattern);實際使用我都會搭配 regex101 這類能夠提供說明及快速確認搜尋結果的網站。
這章節中提到的Python方法類似JS中的String.prototype.match()
或RegExp.prototype.exec()
。
re
modulePython中透過引入 re
這個模組能使用正規表達式的相關方法:
import re
這邊先放各個方法官方文的說明,從實作來看會比較好理解:
re.match(pattern, string, flags=0)
match()
會從開頭比對,開頭沒有符合就沒了,並會區分大小寫,另外有多行(\n
)的情況下,只會搜尋第一行,如果要搜尋內容或多行的話要用後面提到的search()
:
import re
# c 在內容中但開頭是a所以不算
print(re.match("c", "abcdef")) # None
# X 在內容中但是最後一行所以不算
print(re.match("X", "A\nB\nX", re.MULTILINE)) # None
第三個參數
re.MULTILINE
是讓方法搜尋多行,類似JS中寫string.match(/.../m)
或是new RegExp("...", "m")
當中的m的用途。
若是有符合的話,match object中可以透過span()
取得起始及結束的索引值;可以搭配[start:end]
片取符合的字串:
import re
texts = "It was the best of times, it was the worst of times;"
match = re.match("It was the best of times", texts)
span = match.span()
start, end = span
print(texts[start: end]) # It was the best of times
re.search(pattern, string, flags=0)
跟match()
不一樣的點是預設會搜尋整個文字內容,就算沒有加re.MULTILINE
也會搜尋多行,但只會回傳第一個比對符合的值。不過可以在search pattern開頭加上^
指定要從開頭比對,也就跟match()
的效果一樣:
import re
print(re.search("c", "abcdef")) # <re.Match object; span=(2, 3), match='c'>
print(re.search("X", "A\nB\nX")) # <re.Match object; span=(4, 5), match='X'>
# c 在內容中但開頭是a所以不算
print(re.search("^c", "abcdef")) # None
texts = "It was the best of times, it was the worst of times;"
match = re.search("was the", texts)
span = match.span()
# 句子中有兩組 "was the",第一組找到就回傳並結束搜尋
print(span) # (3, 10)
start, end = span
print(texts[start: end]) # was the
re.findall(pattern, string, flags=0)
import re
texts = "It was the best of times, it was the worst of times;"
match = re.findall("was the", texts)
print(match) # ['was the', 'was the']
re.sub(pattern, repl, string, count=0, flags=0)
搜尋string
參數中是否有符合pattern
的值,若有的話使用repl
取代,repl
可以是字串或是函式;若沒有符合的話則回傳string
參數內容:
import re
texts = "It was the best of times, it was the worst of times;"
match = re.sub(" ", "_", texts)
print(match)
# It_was_the_best_of_times,_it_was_the_worst_of_times;
no_match = re.sub("BEST", "WORST", texts)
print(no_match)
# It was the best of times, it was the worst of times;
# 判斷要替換的內容
def repl(match):
if match.group(0) == "best": return "BEST"
else: return "WORST"
# repl參數使用函式
repl_match = re.sub("best|worst", repl, texts)
print(repl_match)
# It was the BEST of times, it was the WORST of times;
pattern
參數中的"|"符號代表「或」;這裡代表尋找"best"或是"worst"兩個字詞。re.split(pattern, string, maxsplit=0, flags=0)
搜尋string
參數中是否有符合pattern
的內容,並用其作為分隔依據,回傳分隔後(不包含分隔符)的段落list,如果有給maxsplit
的話,超出該值的部分會做為一個段落被放進回傳的list中:
import re
texts = "It was the best of times, it was the worst of times;"
match = re.split(r"\W+", texts)
print(match)
# ['It', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times', '']
no_match = re.split("BEST", texts)
print(no_match)
# ['It was the best of times, it was the worst of times;']
match_split = re.split(r"\W+", texts, maxsplit=1)
print(match_split)
# ['It', 'was the best of times, it was the worst of times;']
pattern
參數中的\W+
是指任何非字元[^a-zA-Z0-9_]
的字。match
, search
, findall
在不指定第三個參數的情況下,都會區分大小寫;而要不區分大小寫有幾種方式:
re.I
或re.IGNORECASE
到第三個參數(flags
)的方式讓搜尋不區分大小寫:import re
texts = "It was the best of times, it was the worst of times;"
match = re.findall("IT", texts, re.I)
print(match) # ['It', 'it']
import re
texts = "It was the best of times, it was the worst of times;"
match_1 = re.findall("[Ii][Tt]", texts)
print(match_1) # ['It', 'it']
match_2 = re.findall("IT|It|it", texts)
print(match_2) # ['It', 'it']
[Ii]
是指搜尋可以接受I
或i
,同理[Tt]
是可以接受T
或t
兩個配對起來就是 IT, It, iT, it 四個字都被找到。|
代表「或」已經在上面Substring的部分有提及。? 明天的內容會來細看patterns的語法。