iT邦幫忙

1

dataFrame技巧九式- 讀、寫

dataFrame技巧總介紹-
python之pandas模組dataFrame九大技巧總覽- 增刪查改讀寫切併排

csv, excel是常見的檔案類型,
這邊學習怎麼讀寫用python的pandas模組讀寫csv, excel檔

定義: 從檔案讀取資料轉為dataFrame的格式,例如讀取excel, csv檔

定義: 將dataFrame存為常見資料檔案格式,例如excel, csv檔

資料準備- 隨機創建資料表

我們用隨機生成的資料表供練習,
比如說:

import random
import pandas as pd
import numpy as np

def random_str(length=3, miss_rate = 0):
    assert 0 <= miss_rate <= 1, 'probability must in [0,1]'
    miss = random.choices(range(0,2), weights= [1-miss_rate, miss_rate])[0]
    return None if miss else ''.join([chr(ord('a')+random.randrange(26)) for _ in range(length)])

def random_num(low, high, miss_rate = 0):
    # miss_rate between 0,1
    assert 0 <= miss_rate <= 1, 'probability must in [0,1]'
    miss = random.choices(range(0,2), weights= [1-miss_rate, miss_rate])[0]
    return np.nan if miss else random.randint(low,high)
        
    print(miss)

def random_df(num):
    department = ['A','B','C', None]
    data = {"student_id":[random_num(10000,20000) for _ in range(num)],
            "student_name":[random_str() for _ in range(num)],
             "department":[random.choice(department) for _ in range(num)]}
    return pd.DataFrame.from_dict(data)

random.seed(30) #為了確保每次執行程式生成的資料相同,此設固定的random seed
df1 = random_df(5) 
print("原始資料:")
print(df1)

結果:

原始資料:
   student_id student_name department
0       14738          qha          B
1       13440          ftq          C
2       10794          ulr          A
3       12196          mah          C
4       17561          viy          B

寫csv檔、excel檔範例

df1.to_excel('table.xls')
df1.to_csv('data.csv', encoding = 'big5')
df1.to_csv('data_no_index.csv', encoding = 'utf-8', index = False)

這邊最重要的兩個參數為encodingindex
encoding為編碼方式,
如果資料裡面含中文字元的話,
可以存big5編碼,用excel打開才不會是亂碼。

index預設是True
如果設成False就不會把index存起來,如圖示:

https://ithelp.ithome.com.tw/upload/images/20200906/20117114NcynSGId4M.png

個人建議是index設成False即可

讀csv檔、excel檔範例

df2 = pd.read_excel('table.xls')
print(df2)
df3 = pd.read_csv('data.csv', encoding='big5')
print(df3)
df4 = pd.read_csv('data_no_index.csv', encoding='utf-8')
print(df4)

這邊直接拿剛剛寫好的檔案為範例,
注意read_csv的encoding需要選擇對應的編碼才有辦法讀csv檔

結果:

   student_id student_name department
0       14738          qha          B
1       13440          ftq          C
2       10794          ulr          A
3       12196          mah          C
4       17561          viy          B

   Unnamed: 0  student_id student_name department
0           0       14738          qha          B
1           1       13440          ftq          C
2           2       10794          ulr          A
3           3       12196          mah          C
4           4       17561          viy          B

   student_id student_name department
0       14738          qha          B
1       13440          ftq          C
2       10794          ulr          A
3       12196          mah          C
4       17561          viy          B

從結果可以理解為什麼說存csv檔建議index設成False即可,
如果index為True的話,
讀檔時會多讀進一行「0,1,2,3,4」,
導致重複的資料

參考資料

這份資料是pandas的文檔說明,有不懂的函數可以在這邊查

  1. pandas.DataFrame.to_csv
  2. pandas.read_csv

尚未有邦友留言

立即登入留言