iT邦幫忙

2023 iThome 鐵人賽

DAY 5
0
AI & Data

30天胡搞瞎搞學會pyspark系列 第 5

[ Day 5 ] - Pyspark | 介紹 - DataFrame篇 - Filter

  • 分享至 

  • xImage
  •  

過濾一些選擇,可以得到更好的資訊,或許也可以成為一個更好的人吧?
就像你要找一個非常優秀的吹風機時,你也會在購物網站下一些特定的filter()讓你能夠更快速的選購到更適合自己的商品,像是最近入手了一台P牌的高級吹風機,瞬間就讓洗頭變成一件快樂的事情了!咳咳,扯遠了
總而言之讓我們來看看要怎麼使用這幾個function吧~

這些function對我來說的使用情境都是在做Data Cleaning的時候,
我需要錨定一些特殊的資料樣態,確定我現在做的清理是成功且有效的

這時候我就會使用下述的這些function!
那麼, 開始囉!

1. filter()

這個很直觀,基本上就是選擇你需要的,就是在做過濾篩選的概念!
所以當我們想要限定選取的範圍的時候,基本上就會使用filter()

rdd = sc.parallelize(
    [
    ("drink", 2, "Carmen",23),
     ("movie", 2, "Juliette",16),
     ("writing", 2, "Don José",25), 
     ("sleep", 2, "Escamillo",30),
     ("play", 2, "Roméo",18)
    ]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.filter(col('Age')>20).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
|  Thing|Hour|     Name|Age|
+-------+----+---------+---+
|  drink|   2|   Carmen| 23|
|  movie|   2| Juliette| 16|
|writing|   2| Don José| 25|
|  sleep|   2|Escamillo| 30|
|   play|   2|    Roméo| 18|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.filter(col('Age')>20).show()
+-------+----+---------+---+
|  Thing|Hour|     Name|Age|
+-------+----+---------+---+
|  drink|   2|   Carmen| 23|
|writing|   2| Don José| 25|
|  sleep|   2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+OUTPUT+---------+---+------------+
'''

2. where()

用法與filter()可以說是一模一樣, 不對就是一模一樣, 我會稱它為SQL好朋友, 如果你是一個專門在做關聯是資料倉儲的Data Engineer轉職的話, 你應該會飛非常熟悉這個語法,基本上就是SQL的where然後換個皮而已

情境說明:
SQL 好朋友, 快速上手, 效果=filter()
當然也可以加入不同組合的combo技XD

rdd = sc.parallelize(
    [
    ("drink", 2, "Carmen",23,'Female'),
     ("movie", 2, "Juliette",16,'Female'),
     ("writing", 2, "Don José",25,'Male'), 
     ("sleep", 2, "Escamillo",30,'Male'),
     ("play", 2, "Roméo",18,'Male')
    ]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.where(((col('Age')>20)&(col('Gender')=='Male'))
         |(col('Thing')=='sleep')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
|  Thing|Hour|     Name|Age|Gender|
+-------+----+---------+---+------+
|  drink|   2|   Carmen| 23|Female|
|  movie|   2| Juliette| 16|Female|
|writing|   2| Don José| 25|  Male|
|  sleep|   2|Escamillo| 30|  Male|
|   play|   2|    Roméo| 18|  Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.where(((col('Age')>20)&(col('Gender')=='Male'))
         |(col('Thing')=='sleep')).show()
+-------+----+---------+---+------+
|  Thing|Hour|     Name|Age|Gender|
+-------+----+---------+---+------+
|writing|   2| Don José| 25|  Male|
|  sleep|   2|Escamillo| 30|  Male|
+-------+----+---------+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''

3. limit()

這個又又又更好懂了,就是限制你產出的筆數啦, 通常我都會是在做少量資料的時候利用這個來做限定,可以減少執行時間,也大概可以抓到你想要的範圍

情境說明:
可以有效的限制資料量,也可以確認你想跑的範圍,減少執行的時間,多用開發測試時

rdd = sc.parallelize(
    [
    ("drink", 2, "Carmen",23,'Female'),
     ("movie", 2, "Juliette",16,'Female'),
     ("writing", 2, "Don José",25,'Male'), 
     ("sleep", 2, "Escamillo",30,'Male'),
     ("play", 2, "Roméo",18,'Male')
    ]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.limit(2).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
|  Thing|Hour|     Name|Age|Gender|
+-------+----+---------+---+------+
|  drink|   2|   Carmen| 23|Female|
|  movie|   2| Juliette| 16|Female|
|writing|   2| Don José| 25|  Male|
|  sleep|   2|Escamillo| 30|  Male|
|   play|   2|    Roméo| 18|  Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.limit(2).show()
+-----+----+--------+---+------+
|Thing|Hour|    Name|Age|Gender|
+-----+----+--------+---+------+
|drink|   2|  Carmen| 23|Female|
|movie|   2|Juliette| 16|Female|
+-----+----+--------+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.limit(0).show()
+-----+----+----+---+------+
|Thing|Hour|Name|Age|Gender|
+-----+----+----+---+------+
+-----+----+----+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''

4. first()

恩,就是取第一筆
恩,平常大概是不太會用,不過可以簡單的確認資料樣態吧

rdd = sc.parallelize(
    [
    ("drink", 2, "Carmen",23,'Female'),
     ("movie", 2, "Juliette",16,'Female'),
     ("writing", 2, "Don José",25,'Male'), 
     ("sleep", 2, "Escamillo",30,'Male'),
     ("play", 2, "Roméo",18,'Male')
    ]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.limit(2).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
|  Thing|Hour|     Name|Age|Gender|
+-------+----+---------+---+------+
|  drink|   2|   Carmen| 23|Female|
|  movie|   2| Juliette| 16|Female|
|writing|   2| Don José| 25|  Male|
|  sleep|   2|Escamillo| 30|  Male|
|   play|   2|    Roméo| 18|  Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.first()
Row(Thing='drink', Hour=2, Name='Carmen', Age=23, Gender='Female')
+---------+---+------------+OUTPUT+---------+---+------------+
'''

如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!

我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】


上一篇
[ Day 4 ] - Pyspark | 介紹 - DataFrame篇 - Select
下一篇
[ Day 6 ] - Pyspark | 介紹 - DataFrame篇 - Sample
系列文
30天胡搞瞎搞學會pyspark30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言