【Day 20】RegEx模式

2022 iThome 鐵人賽

DAY 20

自我挑戰組

從前端角度看30天學Python系列第 20 篇

14th鐵人賽

allieschen

2022-10-04 23:14:13

1742 瀏覽

分享至

Patterns
Greedy vs Non-Greedy
re Flags
編譯 pattern 改善效能

這篇文章是閱讀Asabeneh的30 Days Of Python: Day 18 - Regular Expressions後的學習筆記與心得。

因為 Day 19 的內容看起來很充實了，改到 Day 20 繼續。?

這邊原文中提到一個前綴(prefix)字元 - r，代表raw string literals並不是RegEx pattern一定加這個來宣告，它只是讓pattern看起來比較簡潔，作用就像 JavaScript (以下簡稱JS)中\<regexp>\那兩個斜線(backslash)的用意：

raw string literals 的解釋 - 參考這則回答

Google的Python教學中建議寫 RegEx patterns 時都使用raw strings：
"The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit."

import re

regexp = "\\n"
regexp_raw = r"\\n"

print(f"regexp is: {regexp}\nregexp_raw is {regexp_raw}")
"""
regexp is: \n
regexp_raw is \\n
"""

texts = "may the force\nbe with you"
print(texts)
"""
may the force
be with you
"""

matches = re.search(regexp, texts)
matches_raw = re.search(regexp_raw, texts)

print(matches) # <re.Match object; span=(13, 14), match='\n'>
print(matches_raw) # None

在沒有 r 的情況下，regexp 這個變數印出來是\n也就是第一個 \ 成了跳脫字元，才能印出\n，並在re.search()中用來查找。
另一方面，因為regexp_raw中是 raw strings (有加r)，跳脫字元會被留下來，因此印出\\n，並且查找這個值，但不會找到所以印出None。

另一個例子，這邊建議先看過下方的 Patterns 條目再回來看，參考自這個問題提問者寫的例子：

import re

txt = 'hello    there    there'

# example 1 - raw strings
regex = r'(\w+)(\s+\1)+'
repl = r'\1'

print(f'regex is: {regex}\nrepl is: {repl}')
"""
regex is: (\w+)(\s+\1)+
repl is: \1
"""
print (re.sub(regex, repl, txt))
# hello     there


# example 2 - 不是raw strings
regex_2 = '(\w+)(\s+\1)+'
repl_2 = '\1'

print(f'regex_2 is: {regex_2}\nrepl_2 is: {repl_2}')
"""
regex_2 is: (\w+)(\s+)+
repl_2 is: 
"""
print (re.sub(regex_2, repl_2, txt))
# hello     there      there

\1指的是群組1，也就是(\w+)比對到的東西，會有 hello 和 there 兩個符合，但因為 pattern 是要：字組A+空白+字組A，因此只會有 there 符合。
example 2 因為沒有用r也沒有用跳脫字元\的關係，\1在regex_2和repl_2裡被編譯了，因此re.sub()中配對不到；而且儘管印出的是(\w+)(\s+)+，還是會印出原始txt，跟實際上寫如下方這樣寫還是不同的：

import re

regex_2 = '(\w+)(\s+)+'
repl_2 = ''

print (re.sub(regex_2, repl_2, txt)) # there

Patterns

跟JS差不多，這些字元能被用在 patterns，並有各自的意義，這邊強烈建議可以透過 RegexOne 上的教學做中學。這邊透過幾個例子簡單說明：

import re

txt = "So we beat on, boats against the current, borne back ceaselessly into the past."

# [] 會尋找方框裡面的字，可以用 a-z 指稱要找 a 到 z
print(re.search(r"[a-z0-9]", txt))  # o

# \ 是跳脫字元，如同上方提到有些字元跳脫的話，會被轉譯成特殊字元
# \w 會尋找字，定義等同[a-zA-Z0-9_]
print(re.search(r"[a-c]\w", txt))  # be

# . 代表一個字
print(re.search(r".", txt))  # S

# ^ 尋找以...為開頭
print(re.findall(r"^So", txt))  # So

# [^] 排除尋找^後面的字
print(re.findall(r"[^a-z\s]", txt))  # ['S', ',', ',', '.']

# $ 標示要以...為結尾，後面不能有其他字
print(re.search(r"st\.$", txt))  # st.

# 這邊 * + ? 這三個可以一起看，比較差異
# * 代表0或更多
print(re.findall(r"t\w*", txt))
# ['t', 'ts', 't', 'the', 't', 'to', 'the', 't']

# + 代表1或更多，可以看到相比 * 沒有納入t
print(re.findall(r"t\w+", txt))  # ['ts', 'the', 'to', 'the']

# ? 代表0或1個，可以看到相比 * 是比對到th而非the
print(re.findall(r"t\w?", txt))  # ['t', 'ts', 't', 'th', 't', 'to', 'th', 't']

# {5} 代表字要重複多少次大括號內的數字，這個例子是要找5個字的
# ['boats', 'again', 'curre', 'borne', 'cease', 'lessl']
print(re.findall(r"\w{5}", txt))

# {5,} 代表字至少要重複多少次
# ['boats', 'against', 'current', 'borne', 'ceaselessly']
print(re.findall(r"\w{5,}", txt))

# {5,} 代表字要重複多少次並在多少次以內，這個例子是要找5-9個字的
# ['boats', 'against', 'current', 'borne', 'ceaseless']
print(re.findall(r"\w{5,9}", txt))

# | 代表「或」
print(re.findall(r"beat|boat", txt))  # ['beat', 'boat']  

# () 代表群組，這個例子如果寫.group(0)(整個尋找的結果)
# 可以看到會有一個空格(\s)也在結果中
# 所以要排除空格，可以選群組1，也就是(c\w+)的部份
print(re.search(r"\s(c\w+)", txt).group(1))  # "current"

Greedy vs Non-Greedy

參考 Google的Python教學

參考中提到這個例子，如果你有 fooso on 這樣一個字串，當你嘗試用 "<.*>" 這樣的 pattern 去查找時，會拿到什麼：

import re

els = "<b>foo</b><i>so on</i>"

result = re.search(r"<.*>", els)

print(els[result.start():result.end()])
# <b>foo</b><i>so on</i>

可以發現，找出來的值不是  而是整句字串(也是開頭和結尾符合<.*>)，這個特性是.*的，被稱為 greedy (貪婪的)。

反之要在搜尋到第一個符合的 pattern 就結束，也就是 ，可以在 pattern 中加上 ? (零或一)，變成 <.*?> 這樣則被稱為 non-greedy。

? 小知識：文章中提到 *? 這是源自 Perl 這個語言的，有支援Perl的插件(extension) 的正規表達式會稱為 Perl Compatible Regular Expression，縮寫為PCRE；在 regex101、RegExr 這兩個RegEx工具網站，可以在選擇要用的引擎的地方看到。

`re` Flags

Day 19 正規表達式中提到的re模組中的方法，都有個參數 - flags，透過給這個參數特定的值，能改變搜尋的目標，幫助簡化patterns的撰寫

這邊列出幾個常用的flags，基本上也在昨天的方法中有提及到了： -- 參考 RegexOne 上的這個主題

re.IGNORECASE：忽略大小寫
re.MULTILINE：在字串包含換行字元(\n)的情況下，這能讓以...開頭(^)，以...結束($)去配對到每一行的開頭與結尾，而不是整個字串的開頭與結尾。
re.DOTALL：使點(.)能夠對應到全部的字元，包含換行字元(\n)。

Flags列表 | Python Docs

編譯 pattern 改善效能

也是參考 RegexOne 的這個主題(同re Flags條目內提及)

在Python中，如果有大量的字串都要創造一個新的正規表達式的 pattern ，這可能會拖慢執行速度，所以當要測試或提取的資訊都是用同一個表達式的時候，可以用 re.compile() 這個方法，回傳一個 re.RegexObject：

語法：regexObject = re.compile(pattern, flags=0)

透過這個物件呼叫 Day 19提到的幾個方法，就不用再傳入 patterns 參數了：

import re

txt = "So we beat on, boats against the current, borne back ceaselessly into the past."

regex = re.compile(r"(b\w+\b)")

result = regex.search(txt)

print(txt[result.start():result.end()] + "\n---")
"""
beat
---
"""

for result in regex.findall(txt):
    print(result)
"""
beat
boats
borne
back
"""

print(regex.sub(r"--\1--", txt))
# So we --beat-- on, --boats-- against the current, --borne-- --back-- ceaselessly into the past.