iT邦幫忙

第 11 屆 iThome 鐵人賽

DAY 5
1
AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 5

Day05 Pandas skills: read in files. Pandas操作:讀取不同格式

  • 分享至 

  • xImage
  •  

先來介紹如何讀入(.csv)檔、以及Pandas一些基本常用指令解說:
The first part will be about how to read in (.csv) files and some frequently used functions in Pandas.

演示程式碼讀入的是從網站下載的檔案。
The data in my code was downloaded from this site.

Read in CSV Files 讀入CSV檔

import pandas as pd # 載入套件並縮寫
df = pd.read_csv('example.csv') # 讀入後指定變數名稱df read in the file and name it df  

https://ithelp.ithome.com.tw/upload/images/20190906/20119709fmxjm9PThK.jpg

檔案讀進來後,使用一些基本指令來看一下資料,對資料有個概念。
After reading in the file, we can use some functions to have a look of what the data is like.

# .head()功能預設讀出資料中前五筆,裡面可以裝整數,看要讀取前幾筆
# .head() will select the top n rows of the data, leave blank to get 5
df.head()

https://ithelp.ithome.com.tw/upload/images/20190906/20119709se8bpYfzVN.jpg

# .tail()功能則是預設讀出資料中最後五筆,裡面可以裝整數,看要讀取最後幾筆
# .tail() on the other hand will select the bottom n rows of the data, leave blank to get 5
df.tail()

https://ithelp.ithome.com.tw/upload/images/20190906/20119709NR1k1vrD5F.jpg

.iloc[] 用法

# .iloc[a:b] 叫出指定列,清單中頭值包含,尾值不包含 
# .iloc[a:b] select rows by position. a included, b not included.
df.iloc[2:4] 

https://ithelp.ithome.com.tw/upload/images/20190906/20119709Fj3StFG9T3.jpg

# .iloc[a:b, c:d] 叫出[指定欄, 指定列]
# .iloc[a:b, c:d] select [rows, columns]
df.iloc[:, :5] # 叫出所有列,叫出前五欄 select all rows, the top five columns

https://ithelp.ithome.com.tw/upload/images/20190906/20119709ePGAOcElOy.jpg

# .iloc[[], []] 叫出[指定欄, 指定列] 
# .iloc[[], []] select [[rows], [columns]]
df.iloc[[0], [0, 1, 2]] 

https://ithelp.ithome.com.tw/upload/images/20190906/201197091D9PrmN8cm.jpg

# .index叫出每列名稱
# .index will select the head of rows
df.index 

https://ithelp.ithome.com.tw/upload/images/20190906/20119709L47oQtsoah.jpg

# .columns叫出每欄名稱
# .columns will select the head of columns
df.columns 

https://ithelp.ithome.com.tw/upload/images/20190906/20119709pLnoubwL3P.jpg

# .shape看資料框架有幾欄幾列
# .shape will call the dimension of the DataFrame
df.shape 

https://ithelp.ithome.com.tw/upload/images/20190906/20119709SwpEVoevWG.jpg

# .info查看資料框架的一些資訊
# .info will call the some information of the DataFrame
df.info() 

https://ithelp.ithome.com.tw/upload/images/20190906/201197098qjL2JJ5kI.jpg

以下簡短介紹幾種其他格式檔案讀取方式:

Read in other formats:

Text 文本 (.txt)

with open('example.txt', 'r') as ex: # ’r’表示讀入 'r' means read in mode
    data = ex.readlines() # 逐行讀取並存成data read each lines and save as data
print(data)

https://ithelp.ithome.com.tw/upload/images/20190906/20119709YjD9HNOoRy.jpg

圖像檔 (.png/ .jpg ...) 圖像格式可以使用PIL、Skimage或CV2。

Formats (.png/ .jpg ...) could be read in by packages like PIL, Skimage, and CV2.

用CV2讀取 Using CV2

# CV2的速度較快,但色彩模式會以BGR讀入 CV2 reads faster but in BGR mode
import cv2
import numpy as np
import matplotlib.pyplot as plt
image = cv2.imread('example.jpg') 
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) 
# 轉成RGB再存回變數 remember to convert to RGB mode and save back to variable
image = np.array(image) # Convert img to numpy array
plt.imshow(image)
plt.show()

https://ithelp.ithome.com.tw/upload/images/20190906/20119709U7Sx5Q8wG7.png

用PILe讀取 Using PIL

from PIL import Image
image = Image.open('example.jpg')
image = np.array(image) # Convert img to numpy array
plt.imshow(image)
plt.show()

https://ithelp.ithome.com.tw/upload/images/20190906/20119709U7Sx5Q8wG7.png

用Skimage讀取 Using Skimage

import skimage.io as skio 
image = skio.imread('example.jpg')
image = np.array(image) # Convert img to numpy array
plt.imshow(image)
plt.show()

https://ithelp.ithome.com.tw/upload/images/20190906/20119709U7Sx5Q8wG7.png

其他一些格式讀入方式。Read in some other formats.

矩陣檔(.mat) MATLAB使用的二進制數據容器格式文件。(.mat) is a binary data container.

import scipy.io as sio # 載入Scipy
data = sio.loadmat('example.mat')

Python (.npy) 可以儲存經過處理後的資料。(.npy) is used to store processed data.

import numpy as np
arr = np.load('example.npy')

Pickle (.pkl) 可以儲存經過處理後的資料。(.pkl) is used to store processed data.

import pickle
with open('example.pkl', 'rb') as ex:
    arr = pickle.load(ex)

Json (.json)

import json # 先載入套件
with open('example.json','r') as ex:
    data = json.load(ex)

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料:

[1] 第二屆機器學習百日馬拉松內容

[2] DataFrame

[3] Berlin Open Data


上一篇
Day04 Outlier and some Numpy. 離群值與Numpy操作
下一篇
Day06 Pandas skills: Data Wrangling. Pandas操作:資料角力
系列文
Hands on Data Cleaning and Scraping 資料清理與爬蟲實作30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言