2024 iThome 鐵人賽

DAY 23

AI/ ML & Data

藉由 Python 介紹統計學與機器學習系列第 23 篇

Day 23：【ML-3】iris dataset -- Explore data

16th鐵人賽

Meteor

2024-10-07 23:41:39

85 瀏覽

分享至

Q-Q Plot

利用 pingouin 套件中的功能來畫各個變數的 Q-Q Plot。pingouin 是一個開源的統計套件，其中有許多統計的功能，除了畫 Q-Q Plot 之外，還可以計算 T-test、Pearson's correlation、Test of normality、ANOVA 等。若是不太有能力自己寫出統計相關的 code，推薦使用 pingouin。

pingouin 網站中的引導頁面中還有介紹執行各種統計方法的步驟，以樹狀圖的方式說明判斷要使用哪種統計方法與相應的程式碼，樹狀圖擷取如下。

1. ANOVA

url = 'https://pingouin-stats.org/build/html/_images/flowchart_one_way_ANOVA.svg'
Image(url=url)

2. Correlation

url = 'https://pingouin-stats.org/build/html/_images/flowchart_correlations.svg'
Image(url=url)

3. Non-Parametric

url = 'https://pingouin-stats.org/build/html/_images/flowchart_nonparametric.svg'
Image(url=url)

以下先安裝套件，再畫 Q-Q Plot。

!pip install pingouin

圖片由上到下分別為

Sepal Length
Sepal Width
Petal Length
Petal Width

的 Q-Q Plot，其中紅色虛線為 $95%$ 信心水準下的信賴上界與下界。

觀察 Q-Q Plot ，Sepal Length 與 Sepal Width 共 2 變數的分布較為接近常態分配；而另位 2 個變數看起來像是雙峰分布（雖從直方圖看來不一定是雙峰分布），所以不為常態分配。

import pingouin as pg

for i in range(len(figures)):
  pg.qqplot(df[figures[i]])
  plt.show()

6. Split iris dataset in Training and Testing

n, p = df.shape

(n, p)

from sklearn.model_selection import train_test_split

# Divide dataset into predictive variables and response variable.
X, y = df[figures], df['class']

# Take (1 - test_size) * 150 samples as training data, and the other as testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=20241006)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)`

從 X_train、y_train 確認資料是否已經打亂了。確認資料已確實打散。

X_train.head()

X_train.describe()

y_train.value_counts()

Day 22：【ML-2】iris dataset -- Explore data

Day 24：【ML-4】iris dataset --

系列文

藉由 Python 介紹統計學與機器學習共 31 篇

RSS系列文訂閱系列文

1 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22195 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

藉由 Python 介紹統計學與機器學習系列 第 23 篇