iT邦幫忙

2023 iThome 鐵人賽

DAY 11
0
AI & Data

30天胡搞瞎搞學會pyspark系列 第 11

[ Day 11 ] - Pyspark | 清理 - 字串篇-2.2 : regexp_extract(), regexp_replace(), rlike()

  • 分享至 

  • xImage
  •  

相信經過前一篇落落長的說明後,應該很了解regular expression是在幹嘛了吧
那我們今天就開始來進入Pyspark與regular expression 交織出的美妙樂曲吧(係咧供三小)

今天就讓我們來聊聊Pyspark的regular expression三兄弟吧

  • regexp_extract()
  • regexp_replace()
  • rlike()

雖然這三種的使用情境稍微有點不太一樣,但他們同樣是使用正規表示式來做資料的處理,我想這樣也是可以放在一起吧XD

開始囉!

1. regexp_extract()

會包含三個參數
regexp_extract(col,pattern,index)
col : 就是你的那個需要處理的欄位或字串
pattern : 可以是 regular expression 的規則,也可以是一般的字串
index : 第幾個位置

情境說明:通常會是使用在要取出特定的某種pattern,才會使用extract的功能
為什麼會與regular expression有關聯呢,在這邊的pattern的地方,除了可以輸入一般的字串之外,也可以利用各種pattern幫助你做資料的處理!

rdd = sc.parallelize(
    [
    ("drinking,play", 2, "Carmen",23,'Female'),
     ("moving,music", 2, "Juliette",16,'Female'),
     ("writing,draw", 2, "Don José",25,'Male'), 
     ("sleeping,run", 2, "Escamillo",30,'Male'),
     ("playing,climb", 2, "Roméo",18,'Male')
    ]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
## Use Normal string to extract
df.select(regexp_extract(col('Thing'),'ing',0).alias('extract "ing"')).show()
## Use regular Rule to extract
df.select(regexp_extract(col('Thing'),'\w*',0).alias('extract by "\w*"')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------------+----+---------+---+------+
|        Thing|Hour|     Name|Age|Gender|
+-------------+----+---------+---+------+
|drinking,play|   2|   Carmen| 23|Female|
| moving,music|   2| Juliette| 16|Female|
| writing,draw|   2| Don José| 25|  Male|
| sleeping,run|   2|Escamillo| 30|  Male|
|playing,climb|   2|    Roméo| 18|  Male|
+-------------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.select(regexp_extract(col('Thing'),'ing',0).alias('extract "ing"')).show()
+-------------+
|extract "ing"|
+-------------+
|          ing|
|          ing|
|          ing|
|          ing|
|          ing|
+-------------+
+---------+---+------------+OUTPUT+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.select(regexp_extract(col('Thing'),'\w*',0).alias('extract by "\w*"')).show()
+----------------+
|extract by "\w*"|
+----------------+
|        drinking|
|          moving|
|         writing|
|        sleeping|
|         playing|
+----------------+
+---------+---+------------+OUTPUT+---------+---+------------+

'''

2. regexp_replace()

會包含三個參數
regexp_replace(col,pattern,replace_by)
col : 就是你的那個需要處理的欄位或字串
pattern : 可以是 regular expression 的規則,也可以是一般的字串
replace_by : 你要將上述那些pattern換成什麼

情境說明:通常會是使用在要替換特定的某種pattern,才會使用replace的功能
為什麼會與regular expression有關聯呢,在這邊的pattern的地方,除了可以輸入一般的字串之外,也可以利用各種pattern幫助你做資料的處理!

rdd = sc.parallelize(
    [
    ("drinking,play", 2, "Carmen",23,'Female'),
     ("moving,music", 2, "Juliette",16,'Female'),
     ("writing,draw", 2, "Don José",25,'Male'), 
     ("sleeping,run", 2, "Escamillo",30,'Male'),
     ("playing,climb", 2, "Roméo",18,'Male')
    ]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
## Use Normal string to replace
df.select(regexp_replace(col('Thing'),'ing','').alias('replace by "ing"')).show()
## Use regular Rule to replace
df.select(regexp_replace(col('Thing'),'\w*','').alias('replace by "\w*"')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------------+----+---------+---+------+
|        Thing|Hour|     Name|Age|Gender|
+-------------+----+---------+---+------+
|drinking,play|   2|   Carmen| 23|Female|
| moving,music|   2| Juliette| 16|Female|
| writing,draw|   2| Don José| 25|  Male|
| sleeping,run|   2|Escamillo| 30|  Male|
|playing,climb|   2|    Roméo| 18|  Male|
+-------------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.select(regexp_replace(col('Thing'),'ing','').alias('replace by "ing"')).show()
+----------------+
|replace by "ing"|
+----------------+
|      drink,play|
|       mov,music|
|       writ,draw|
|       sleep,run|
|      play,climb|
+----------------+
+---------+---+------------+OUTPUT+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.select(regexp_replace(col('Thing'),'\w*','').alias('replace by "\w*"')).show()
+----------------+
|replace by "\w*"|
+----------------+
|               ,|
|               ,|
|               ,|
|               ,|
|               ,|
+----------------+
+---------+---+------------+OUTPUT+---------+---+------------+

'''

3. rlike()

會包含1個參數
rlike(pattern)
pattern : 可以是 regular expression 的規則,也可以是一般的字串

情境說明:通常會是在搜尋的時候使用,其實有點像是like結合regular expression

rdd = sc.parallelize(
    [
    ("drinking,play", 2, "Carmen",23,'Female'),
     ("moving,music", 2, "Juliette",16,'Female'),
     ("writing,draw", 2, "Don José",25,'Male'), 
     ("sleeping,run", 2, "Escamillo",30,'Male'),
     ("climb 3 mountains", 2, "Roméo",18,'Male')
    ]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
## Use Normal string to like
df.select('Thing').filter(col('Thing').rlike('ing')).show()
## Use regular Rule to like
df.select('Thing').filter(col('Thing').rlike('\d+')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-----------------+----+---------+---+------+
|            Thing|Hour|     Name|Age|Gender|
+-----------------+----+---------+---+------+
|    drinking,play|   2|   Carmen| 23|Female|
|     moving,music|   2| Juliette| 16|Female|
|     writing,draw|   2| Don José| 25|  Male|
|     sleeping,run|   2|Escamillo| 30|  Male|
|climb 3 mountains|   2|    Roméo| 18|  Male|
+-----------------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.select('Thing').filter(col('Thing').rlike('ing')).show()
+-------------+
|        Thing|
+-------------+
|drinking,play|
| moving,music|
| writing,draw|
| sleeping,run|
+-------------+
+---------+---+------------+OUTPUT+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.select('Thing').filter(col('Thing').rlike('\d+')).show()
+-----------------+
|            Thing|
+-----------------+
|climb 3 mountains|
+-----------------+
+---------+---+------------+OUTPUT+---------+---+------------+

'''

■ 結語

善用 regular expression,搭配參數的設定,便能輕鬆完成資料處理!

如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!
我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】


上一篇
[ Day 10 ] - Pyspark | 清理 - 字串篇-2.1 : 正規表示式科普( regular expression )
下一篇
[ Day 12 ] - Pyspark | 清理 - 特殊資料型態篇 - Array : explode()
系列文
30天胡搞瞎搞學會pyspark30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言