Day 14 [Python ML、Pandas] 引索、選擇和給值

2021 iThome 鐵人賽

DAY 14

AI & Data

使用python學習Machine Learning系列第 14 篇

13th鐵人賽

guancioul

團隊人工逗點智慧

2021-09-27 18:00:31

704 瀏覽

分享至

Introduction

為了讓資料更好的處理，這邊要學到如何切割資料

import pandas as pd
reviews = pd.read_csv("./winemag-data-130k-v2.csv", index_col=0)
pd.set_option('max_rows', 5)

reviews

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Native accessors

pandas提供了一個方法可以將特定的column取出來

若我們需要取出country的資料，只需要reviews.country

reviews.country

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

也可以用另一個方法來取得country的資料，可以使用中括號

reviews['country']

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

若column中有空格，就沒有辦法用reviews.country providence上面的方法取得資料了

就需要用reviews['country providence']來取得資料

取得資料後可以再用一個[]取得裡面的資料

reviews['country'][0]

'Italy'

Indexing in pandas

pandas有自己存取資料的方式，loc and iloc

Index-based selection

基於使用index來選擇資料

若要選擇一個row，可以使用iloc這個指令

reviews.iloc[0]

country                                                    Italy
description    Aromas include tropical fruit, broom, brimston...
                                     ...                        
variety                                              White Blend
winery                                                   Nicosia
Name: 0, Length: 13, dtype: object

loc 跟 iloc都是 row-first, column-second

若我們要取得第一個column，可以使用以下的方法

reviews.iloc[:, 0]

0            Italy
1         Portugal
            ...   
129969      France
129970      France
Name: country, Length: 129971, dtype: object

再python中，:這個符號代表說全部的值

若要取得前3個值，可以使用以下的方法

reviews.iloc[:3, 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

若只要選擇1跟2的資料

reviews.iloc[1:3, 0]

1    Portugal
2          US
Name: country, dtype: object

在前面的參數中也可以放入list

reviews.iloc[[0, 1, 2], 0]

0       Italy
1    Portugal
2          US
Name: country, dtype: object

若要取得最後5筆資料

reviews.iloc[-5:]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Label-based selection

基於label的資料來取得資料

跟iloc一樣，只是需要放入的值為label

reviews.loc[0, 'country']

'Italy'

上面的方法為取得第0個row country這個column

reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

也可以利用以下的方法取得特定column資料

一般的情況下是 column-first, row-second

在iloc和loc的情況下，為 row-first, column-second

reviews[['taster_name', 'taster_twitter_handle', 'points']]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Choosing between `loc` and `iloc`

有幾個部分會有些許的差別

df.iloc[0:1000]若使用這個方式會取得1000筆資料

df.loc[0:1000]若使用這個方式則會取得1001筆資料

這兩個function還是需要看情況用

Manipulating the index

使用set_index可以將index改成更適合的column

reviews.set_index("title")

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Conditional selection

可以利用判斷式來知道country是否為Italy

reviews.country == 'Italy'

0          True
1         False
          ...  
129969    False
129970    False
Name: country, Length: 129971, dtype: bool

也可以將所有country符合Italy的資料取出來

reviews.loc[reviews.country == 'Italy']

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

若覺得這樣資料量還是太多，還可以再用其他方法將資料取出來

可以再loc中在加入&做運算

reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

若只是想知道country是Italy 或 points 大於等於 90，可以用|做運算

reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

有一個isin函式，可以抓出有某些值的資料

reviews.loc[reviews.country.isin(['Italy', 'France'])]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

另外也可以利用notnull函式，找出不包含NaN的資料

reviews.loc[reviews.price.notnull()]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Assigning data

在pandas中，要給值是非常簡單的

reviews['critic'] = 'everyone'
reviews['critic']

0         everyone
1         everyone
            ...   
129969    everyone
129970    everyone
Name: critic, Length: 129971, dtype: object

或是給數值

reviews['index_backwards'] = range(len(reviews), 0, -1)
reviews['index_backwards']

0         129971
1         129970
           ...  
129969         2
129970         1
Name: index_backwards, Length: 129971, dtype: int64

tags: `IT鐵人賽使用python學習Machine Learning`

Day 13 [Python ML、Pandas] 創建、讀取和寫入

Day 15 [Python ML、Pandas] 統整資料和Maps

系列文

使用python學習Machine Learning 共 29 篇

RSS系列文訂閱系列文

4 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1123 組

團體組數

52 組

累計文章數

23096 篇

完賽人數

656 人

15th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 2018鐵人賽 javascript 2017鐵人賽 python windows php c# windows server linux css 程式設計 react vue.js

IT邦幫忙

使用python學習Machine Learning系列 第 14 篇