iT邦幫忙

2023 iThome 鐵人賽

DAY 4
0
AI & Data

30天胡搞瞎搞學會pyspark系列 第 4

[ Day 4 ] - Pyspark | 介紹 - DataFrame篇 - Select

  • 分享至 

  • xImage
  •  

21天可以養成一個好習慣, 但3天就可以放棄, 但我今天還是準時坐在這裡, 所以今天你的選擇是什麼呢?
我們今天要來討論的是,選擇
阿不是啦是select()

其實好像也差不多,總而言之,這是一個可以用來選取你需要的column的function
Let's Start!

1. select()

相信有寫過SQL的大家都對這個不是很陌生,對的,就是一樣的概念,pyspark 的select(), 跟SQL SELECT其實是一樣的,如果你是個非常熟悉資料庫的人,可以果斷地跳過直接前往下一篇XD

上一篇中有簡單介紹了一下, show(), 相信大家都對他的參數非常熟悉了!
但當我們今天想要調整的不是看到的筆數,而是column欄位的數量時,這時候就可以使用select()

情境說明:
可以用來選取需要的column,不論是在處理資料或是確認data確認細節很有幫助!

rdd = sc.parallelize([("drink", 2, "Carmen",23),("movie", 2, "Carmen",23), ("writing", 2, "Don José",25), ("sleep", 2, "Escamillo",30),("play", 2, "Escamillo",30)])
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.select(df.Name,df.Age).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
|  Thing|Hour|     Name|age|
+-------+----+---------+---+
|  drink|   2|   Carmen| 23|
|  movie|   2|   Carmen| 23|
|writing|   2| Don José| 25|
|  sleep|   2|Escamillo| 30|
|   play|   2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.select(df.Name,df.Age).show()
+---------+---+
|     Name|Age|
+---------+---+
|   Carmen| 23|
|   Carmen| 23|
| Don José| 25|
|Escamillo| 30|
|Escamillo| 30|
+---------+---+
+---------+---+------------+OUTPUT+---------+---+------------+
'''

1-1. select('*') - 多來點選擇吧?

這個也非常簡單易懂啦XD
來個簡單的小說明相信大家就懂了
pyspark : df.show() = df.select('*').show()
SQL : select * from table

情境說明:
當你需要所有欄位時,不可能一個一個打吧
所以小秘招又出現了 - 也就是 select('*')

rdd = sc.parallelize([("drink", 2, "Carmen",23),("movie", 2, "Carmen",23), ("writing", 2, "Don José",25), ("sleep", 2, "Escamillo",30),("play", 2, "Escamillo",30)])
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.select('*').show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
|  Thing|Hour|     Name|age|
+-------+----+---------+---+
|  drink|   2|   Carmen| 23|
|  movie|   2|   Carmen| 23|
|writing|   2| Don José| 25|
|  sleep|   2|Escamillo| 30|
|   play|   2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT+---------+---+------------+
df.select('*').show()
+-------+----+---------+---+
|  Thing|Hour|     Name|Age|
+-------+----+---------+---+
|  drink|   2|   Carmen| 23|
|  movie|   2|   Carmen| 23|
|writing|   2| Don José| 25|
|  sleep|   2|Escamillo| 30|
|   play|   2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+OUTPUT+---------+---+------------+
'''

1-2. select() - 選擇+計算?

情境說明:
當你需要計算一下的時候XD

rdd = sc.parallelize([("drink", 2, "Carmen",23),("movie", 2, "Carmen",23), ("writing", 2, "Don José",25), ("sleep", 2, "Escamillo",30),("play", 2, "Escamillo",30)])
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.select(df.Name, (df.Age + 10).alias('age')).show()
df.select(mean('Age'),avg('Age'),sum('Age'),max('Age'),min('Age')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
|  Thing|Hour|     Name|age|
+-------+----+---------+---+
|  drink|   2|   Carmen| 23|
|  movie|   2|   Carmen| 23|
|writing|   2| Don José| 25|
|  sleep|   2|Escamillo| 30|
|   play|   2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT[1]+---------+---+------------+
df.select(df.Name, (df.Age + 10).alias('age')).show()
+---------+---+
|     Name|age|
+---------+---+
|   Carmen| 33|
|   Carmen| 33|
| Don José| 35|
|Escamillo| 40|
|Escamillo| 40|
+---------+---+
+---------+---+------------+OUTPUT[1]+---------+---+------------+

+---------+---+------------+OUTPUT[2]+---------+---+------------+
df.select(mean('Age'),avg('Age'),sum('Age'),max('Age'),min('Age')).show()
+--------+--------+--------+--------+--------+
|avg(Age)|avg(Age)|sum(Age)|max(Age)|min(Age)|
+--------+--------+--------+--------+--------+
|    26.2|    26.2|     131|      30|      23|
+--------+--------+--------+--------+--------+
+---------+---+------------+OUTPUT[2]+---------+---+------------+
'''

總而言之,這個大概是基本的選取資料的方式, 後面的時間還會使用到很多關於select()的combo技!

2. selectExpr()

我個人認為這個其實與select()是一樣的,只是看個人使用習慣決定!
當然也可以使出各種combo技!

情境說明:
SQL語法熟悉愛好者,會非常容易上手!

rdd = sc.parallelize([("drink", 2, "Carmen",23),("movie", 2, "Carmen",23), ("writing", 2, "Don José",25), ("sleep", 2, "Escamillo",30),("play", 2, "Escamillo",30)])
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.selectExpr("sum(case when Age >= 25 then 1 else 0 end) as ageOver25").show()
df.selectExpr('mean(Age)','avg(Age)','sum(Age)','max(Age)','min(Age)').show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
|  Thing|Hour|     Name|age|
+-------+----+---------+---+
|  drink|   2|   Carmen| 23|
|  movie|   2|   Carmen| 23|
|writing|   2| Don José| 25|
|  sleep|   2|Escamillo| 30|
|   play|   2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+

+---------+---+------------+OUTPUT[1]+---------+---+------------+
df.selectExpr("sum(case when Age >= 25 then 1 else 0 end) as ageOver25").show()
+---------+
|ageOver25|
+---------+
|        3|
+---------+
+---------+---+------------+OUTPUT[1]+---------+---+------------+

+---------+---+------------+OUTPUT[2]+---------+---+------------+
df.selectExpr('mean(Age)','avg(Age)','sum(Age)','max(Age)','min(Age)').show()
+---------+--------+--------+--------+--------+
|mean(Age)|avg(Age)|sum(Age)|max(Age)|min(Age)|
+---------+--------+--------+--------+--------+
|     26.2|    26.2|     131|      30|      23|
+---------+--------+--------+--------+--------+
+---------+---+------------+OUTPUT[2]+---------+---+------------+
'''

結語:

其實兩種方法都可以達到你想要的效果,會使用哪種主要就是看個人習慣,這個部分官方文件也能夠找到蛛絲馬跡!

如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!

我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】


上一篇
[ Day 3 ] - Pyspark | 介紹 - DataFrame篇 - Data Overview
下一篇
[ Day 5 ] - Pyspark | 介紹 - DataFrame篇 - Filter
系列文
30天胡搞瞎搞學會pyspark30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言