過濾一些選擇,可以得到更好的資訊,或許也可以成為一個更好的人吧?
就像你要找一個非常優秀的吹風機時,你也會在購物網站下一些特定的filter()
讓你能夠更快速的選購到更適合自己的商品,像是最近入手了一台P牌的高級吹風機,瞬間就讓洗頭變成一件快樂的事情了!咳咳,扯遠了
總而言之讓我們來看看要怎麼使用這幾個function吧~
這些function對我來說的使用情境都是在做Data Cleaning的時候,
我需要錨定一些特殊的資料樣態,確定我現在做的清理是成功且有效的
這時候我就會使用下述的這些function!
那麼, 開始囉!
filter()
這個很直觀,基本上就是選擇你需要的,就是在做過濾篩選的概念!
所以當我們想要限定選取的範圍的時候,基本上就會使用filter()
rdd = sc.parallelize(
[
("drink", 2, "Carmen",23),
("movie", 2, "Juliette",16),
("writing", 2, "Don José",25),
("sleep", 2, "Escamillo",30),
("play", 2, "Roméo",18)
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age"])
df.show()
df.filter(col('Age')>20).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+
| Thing|Hour| Name|Age|
+-------+----+---------+---+
| drink| 2| Carmen| 23|
| movie| 2| Juliette| 16|
|writing| 2| Don José| 25|
| sleep| 2|Escamillo| 30|
| play| 2| Roméo| 18|
+-------+----+---------+---+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.filter(col('Age')>20).show()
+-------+----+---------+---+
| Thing|Hour| Name|Age|
+-------+----+---------+---+
| drink| 2| Carmen| 23|
|writing| 2| Don José| 25|
| sleep| 2|Escamillo| 30|
+-------+----+---------+---+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
where()
用法與filter()
可以說是一模一樣, 不對就是一模一樣, 我會稱它為SQL好朋友, 如果你是一個專門在做關聯是資料倉儲的Data Engineer轉職的話, 你應該會飛非常熟悉這個語法,基本上就是SQL的where
然後換個皮而已
情境說明:
SQL 好朋友, 快速上手, 效果=filter()
當然也可以加入不同組合的combo技XD
rdd = sc.parallelize(
[
("drink", 2, "Carmen",23,'Female'),
("movie", 2, "Juliette",16,'Female'),
("writing", 2, "Don José",25,'Male'),
("sleep", 2, "Escamillo",30,'Male'),
("play", 2, "Roméo",18,'Male')
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.where(((col('Age')>20)&(col('Gender')=='Male'))
|(col('Thing')=='sleep')).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-------+----+---------+---+------+
| drink| 2| Carmen| 23|Female|
| movie| 2| Juliette| 16|Female|
|writing| 2| Don José| 25| Male|
| sleep| 2|Escamillo| 30| Male|
| play| 2| Roméo| 18| Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.where(((col('Age')>20)&(col('Gender')=='Male'))
|(col('Thing')=='sleep')).show()
+-------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-------+----+---------+---+------+
|writing| 2| Don José| 25| Male|
| sleep| 2|Escamillo| 30| Male|
+-------+----+---------+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
limit()
這個又又又更好懂了,就是限制你產出的筆數啦, 通常我都會是在做少量資料的時候利用這個來做限定,可以減少執行時間,也大概可以抓到你想要的範圍
情境說明:
可以有效的限制資料量,也可以確認你想跑的範圍,減少執行的時間,多用開發測試時
rdd = sc.parallelize(
[
("drink", 2, "Carmen",23,'Female'),
("movie", 2, "Juliette",16,'Female'),
("writing", 2, "Don José",25,'Male'),
("sleep", 2, "Escamillo",30,'Male'),
("play", 2, "Roméo",18,'Male')
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.limit(2).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-------+----+---------+---+------+
| drink| 2| Carmen| 23|Female|
| movie| 2| Juliette| 16|Female|
|writing| 2| Don José| 25| Male|
| sleep| 2|Escamillo| 30| Male|
| play| 2| Roméo| 18| Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.limit(2).show()
+-----+----+--------+---+------+
|Thing|Hour| Name|Age|Gender|
+-----+----+--------+---+------+
|drink| 2| Carmen| 23|Female|
|movie| 2|Juliette| 16|Female|
+-----+----+--------+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.limit(0).show()
+-----+----+----+---+------+
|Thing|Hour|Name|Age|Gender|
+-----+----+----+---+------+
+-----+----+----+---+------+
+---------+---+------------+OUTPUT+---------+---+------------+
'''
first()
恩,就是取第一筆
恩,平常大概是不太會用,不過可以簡單的確認資料樣態吧
rdd = sc.parallelize(
[
("drink", 2, "Carmen",23,'Female'),
("movie", 2, "Juliette",16,'Female'),
("writing", 2, "Don José",25,'Male'),
("sleep", 2, "Escamillo",30,'Male'),
("play", 2, "Roméo",18,'Male')
]
)
df = rdd.toDF(["Thing", "Hour", "Name","Age",'Gender'])
df.show()
df.limit(2).show()
'''
+---------+---+------------+Original Data+---------+---+------------+
df.show()
+-------+----+---------+---+------+
| Thing|Hour| Name|Age|Gender|
+-------+----+---------+---+------+
| drink| 2| Carmen| 23|Female|
| movie| 2| Juliette| 16|Female|
|writing| 2| Don José| 25| Male|
| sleep| 2|Escamillo| 30| Male|
| play| 2| Roméo| 18| Male|
+-------+----+---------+---+------+
+---------+---+------------+Original Data+---------+---+------------+
+---------+---+------------+OUTPUT+---------+---+------------+
df.first()
Row(Thing='drink', Hour=2, Name='Carmen', Age=23, Gender='Female')
+---------+---+------------+OUTPUT+---------+---+------------+
'''
如果有任何不理解、錯誤或其他方法想分享的話,歡迎留言給我!喜歡的話,也歡迎按讚訂閱!
我是 Vivi,一位在雲端掙扎的資料工程師!我們下一篇文章見!Bye Bye~
【本篇文章將同步更新於個人的 Medium,期待與您的相遇!】