Day 15 [Python ML、Pandas] 統整資料和Maps

2021 iThome 鐵人賽

DAY 15

AI & Data

使用python學習Machine Learning系列第 15 篇

13th鐵人賽

guancioul

團隊人工逗點智慧

2021-09-28 13:38:04

3289 瀏覽

分享至

import pandas as pd
pd.set_option('max_rows', 5)
import numpy as np
reviews = pd.read_csv("./winemag-data-130k-v2.csv", index_col=0)

Summary functions

pandas提供一些summary function，可以幫助我們重組資料

例如describe()

reviews.points.describe()

count    129971.000000
mean         88.447138
             ...      
75%          91.000000
max         100.000000
Name: points, Length: 8, dtype: float64

以上是會得到得依些summary資料

若是將類別屬性的資料丟入，會得到什麼東西?

reviews.taster_name.describe()

count         103727
unique            19
top       Roger Voss
freq           25514
Name: taster_name, dtype: object

若只想取得特定的summary資料，可以單獨下指令

reviews.points.mean()

88.44713820775404

類別屬性的資料想要知道有哪些類別，可以利用unique() function

reviews.taster_name.unique()

array(['Kerin O’Keefe', 'Roger Voss', 'Paul Gregutt',
       'Alexander Peartree', 'Michael Schachner', 'Anna Lee C. Iijima',
       'Virginie Boone', 'Matt Kettmann', nan, 'Sean P. Sullivan',
       'Jim Gordon', 'Joe Czerwinski', 'Anne Krebiehl\xa0MW',
       'Lauren Buzzeo', 'Mike DeSimone', 'Jeff Jenssen',
       'Susan Kostrzewa', 'Carrie Dykes', 'Fiona Adams',
       'Christina Pickard'], dtype=object)

若想知道每一個unique的資料在data中出現幾次，可以使用value_counts() function

Maps

這是一個function可以把資料從另一個set中拿出來並且做運算

review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

先將points的mean儲存在review_points_mean

在用lambda將point的值取出來到p，把p-review_points_mean的值儲存進Series中

若想將資料直接儲存進reviews，可以使用applyfunction

def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

apply預設axis=0 or axis='index'，也就是每次都會將column拿出來處理

若是想修改每一列的資料，要使用axis=1 or axis='columns'

map()是回傳一個Series，apply()是回傳一個DataFrame

但這兩者都不會直接修改到原始的數據資料

reviews.head(1)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

有一些內建的方法，可以比map()跟apply()的效率更高

但是這些方法並沒有上面的這兩個function來的靈活

review_points_mean = reviews.points.mean()
reviews.points - review_points_mean

0        -1.447138
1        -1.447138
            ...   
129969    1.552862
129970    1.552862
Name: points, Length: 129971, dtype: float64

reviews.country + " - " + reviews.region_1

0            Italy - Etna
1                     NaN
               ...       
129969    France - Alsace
129970    France - Alsace
Length: 129971, dtype: object