相信經過前一篇落落長的說明後,應該很了解regular expression是在幹嘛了吧
那我們今天就開始來進入Pyspark與regular expression 交織出的美妙樂曲吧(係咧供三小)
今天就讓我們來聊聊Pyspark的regular expression三兄弟吧
雖然這三種的使用情境稍微有點不太一樣,但他們同樣是使用正規表示式來做資料的處理,我想這樣也是可以放在一起吧XD
開始囉!
regexp_extract()
會包含三個參數regexp_extract(col,pattern,index)
col
: 就是你的那個需要處理的欄位或字串pattern
: 可以是 regular expression 的規則,也可以是一般的字串index
: 第幾個位置
情境說明:通常會是使用在要取出特定的某種pattern,才會使用extract的功能
為什麼會與regular expression有關聯呢,在這邊的pattern的地方,除了可以輸入一般的字串之外,也可以利用各種pattern幫助你做資料的處理!
rdd = sc.parallelize(
[
("drinking,play", 2, "Carmen",23,'Female'),
("moving,music", 2, "Juliette",16,'Female'),
("writing,draw", 2, "Don José",25,'Male'),
("sleeping,run", 2, "Escamillo",30,'Male'),
("playing,climb", 2, "Roméo",18,'Male')
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
## Use Normal string to extract
df.select(regexp_extract(col('Thing'),'ing',0).alias('extract "ing"')).show()
## Use regular Rule to extract
df.select(regexp_extract(col('Thing'),'\w*',0).alias('extract by "\w*"')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-------------+----+---------+---+------+
|drinking,play| 2| Carmen| 23|Female|
| moving,music| 2| Juliette| 16|Female|
| writing,draw| 2| Don José| 25| Male|
| sleeping,run| 2|Escamillo| 30| Male|
|playing,climb| 2| Roméo| 18| Male|
+-------------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select(regexp_extract(col('Thing'),'ing',0).alias('extract "ing"')).show()
+-------------+
|extract "ing"|
+-------------+
| ing|
| ing|
| ing|
| ing|
| ing|
+-------------+
+---------+---+------------+OUTPUT+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select(regexp_extract(col('Thing'),'\w*',0).alias('extract by "\w*"')).show()
+----------------+
|extract by "\w*"|
+----------------+
| drinking|
| moving|
| writing|
| sleeping|
| playing|
+----------------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
regexp_replace()
會包含三個參數regexp_replace(col,pattern,replace_by)
col
: 就是你的那個需要處理的欄位或字串pattern
: 可以是 regular expression 的規則,也可以是一般的字串replace_by
: 你要將上述那些pattern換成什麼
情境說明:通常會是使用在要替換特定的某種pattern,才會使用replace的功能
為什麼會與regular expression有關聯呢,在這邊的pattern的地方,除了可以輸入一般的字串之外,也可以利用各種pattern幫助你做資料的處理!
rdd = sc.parallelize(
[
("drinking,play", 2, "Carmen",23,'Female'),
("moving,music", 2, "Juliette",16,'Female'),
("writing,draw", 2, "Don José",25,'Male'),
("sleeping,run", 2, "Escamillo",30,'Male'),
("playing,climb", 2, "Roméo",18,'Male')
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
## Use Normal string to replace
df.select(regexp_replace(col('Thing'),'ing','').alias('replace by "ing"')).show()
## Use regular Rule to replace
df.select(regexp_replace(col('Thing'),'\w*','').alias('replace by "\w*"')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-------------+----+---------+---+------+
|drinking,play| 2| Carmen| 23|Female|
| moving,music| 2| Juliette| 16|Female|
| writing,draw| 2| Don José| 25| Male|
| sleeping,run| 2|Escamillo| 30| Male|
|playing,climb| 2| Roméo| 18| Male|
+-------------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select(regexp_replace(col('Thing'),'ing','').alias('replace by "ing"')).show()
+----------------+
|replace by "ing"|
+----------------+
| drink,play|
| mov,music|
| writ,draw|
| sleep,run|
| play,climb|
+----------------+
+---------+---+------------+OUTPUT+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select(regexp_replace(col('Thing'),'\w*','').alias('replace by "\w*"')).show()
+----------------+
|replace by "\w*"|
+----------------+
| ,|
| ,|
| ,|
| ,|
| ,|
+----------------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
rlike()
會包含1個參數rlike(pattern)
pattern
: 可以是 regular expression 的規則,也可以是一般的字串
情境說明:通常會是在搜尋的時候使用,其實有點像是like結合regular expression
rdd = sc.parallelize(
[
("drinking,play", 2, "Carmen",23,'Female'),
("moving,music", 2, "Juliette",16,'Female'),
("writing,draw", 2, "Don José",25,'Male'),
("sleeping,run", 2, "Escamillo",30,'Male'),
("climb 3 mountains", 2, "Roméo",18,'Male')
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
## Use Normal string to like
df.select('Thing').filter(col('Thing').rlike('ing')).show()
## Use regular Rule to like
df.select('Thing').filter(col('Thing').rlike('\d+')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-----------------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-----------------+----+---------+---+------+
| drinking,play| 2| Carmen| 23|Female|
| moving,music| 2| Juliette| 16|Female|
| writing,draw| 2| Don José| 25| Male|
| sleeping,run| 2|Escamillo| 30| Male|
|climb 3 mountains| 2| Roméo| 18| Male|
+-----------------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select('Thing').filter(col('Thing').rlike('ing')).show()
+-------------+
| Thing|
+-------------+
|drinking,play|
| moving,music|
| writing,draw|
| sleeping,run|
+-------------+
+---------+---+------------+OUTPUT+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.select('Thing').filter(col('Thing').rlike('\d+')).show()
+-----------------+
| Thing|
+-----------------+
|climb 3 mountains|
+-----------------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
善用 regular expression,搭配參數的設定,便能輕鬆完成資料處理!
如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!
我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】