[Day04] - pl.Series與pl.DataFrame - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 4

Software Development

Polars熊霸天下系列第 4 篇

[Day04] - pl.Series與pl.DataFrame

17th鐵人賽 python polars

Jerry Wu

2025-09-10 00:00:53

138 瀏覽

分享至

今天我們來認識pl.Series及 pl.DataFrame。

本日大綱如下：

本日引入模組及準備工作
pl.Series
pl.DataFrame
pl.Series與pl.DataFrame的相互轉換
合併pl.DataFrame與pl.Series
合併多個pl.Series或pl.DataFrame
codepanda

0. 本日引入模組及準備工作

import polars as pl
from polars.testing import (
    assert_frame_equal,
    assert_series_equal,
    assert_series_not_equal,
)

1. `pl.Series`

pl.Series是Polars表示單列的資料結構，其第一個參數name=為其名字，第二個參數數value=為其所儲存的資料。最常見的建構方法為傳入一個iterable或是numpy.array給value=。例如：

s1 = pl.Series("s", [1, 2, 3])
# s1 = pl.Series("s", np.array([1, 2, 3]))

shape: (3,)
Series: 's' [i64]
[
	1
	2
	3
]

以下再介紹兩個dtype=及strict=兩個常用參數。

`dtype=`

如果沒有指定dtype=的話，Polars會自動推斷型別，如上面顯示的i64，因為pl.Int64為預設的整數型別。

我們可以建構另一個s2，並指定其dtype=為pl.Int64後，使用Polars提供的測試函數assert_series_equal()來確認s1與s2相等。如果兩者不相等的話，assert_series_equal()會回報AssertionError。

s2 = pl.Series("s", [1, 2, 3], dtype=pl.Int64)
assert_series_equal(s1, s2)

除了assert_series_equal()外，我們也可以使用assert_series_not_equal()來判斷兩個pl.Series是否不相等。如果兩者相等的話，assert_series_not_equal()會回報AssertionError。

s3 = pl.Series("s", [1, 2, 3], dtype=pl.Float64)
s4 = pl.Series([1, 2, 3], dtype=pl.Float64)
assert_series_not_equal(s3, s4)  # name mismatch ("s" vs. "")

其中，s4這種傳入值作為value=的建構方法是一種常用而默許的anti-pattern。

API文件中文件提到：

It is possible to construct a Series with values as the first positional argument. This syntax considered an anti-pattern, but it can be useful in certain scenarios. You must specify any other arguments through keywords.

因此，s4的name=會被指定為一空字串，故不相等於s3。

`strict=`

strict=預設值為True，代表當各行中的元素不是同一個型別時，會回報TypeError。如果將其設為False，當各行中的元素不是同一個型別時，則會試著將該列轉為能相容各行的型別。例如：

s5 = pl.Series([1, None, "3"], strict=False)

shape: (3,)
Series: '' [str]
[
	"1"
	null
	"3"
]

s5中的元素分別為pl.Int64、pl.Null及pl.String，所以Polars將s5轉換為可以相容三個元素的pl.String型別。請留意，第一個元素由數字「1」變為字串「"1"」。

`pl.Series`的屬性及函數。

pl.Series的維度可以透過pl.Series.shape取得：

s1.shape

(3,)

如果想要取得pl.Series某個元素，可以透過pl.Series.item()以索引的方式取得。例如，我們可以使用索引值0取得第一個元素：

s1.item(0)

以索引值-1取得最後一個元素：

s1.item(-1)

如果想取得排名前幾名的元素可以使用pl.Series.top_k()，而如果想取得後幾名的元素可以使用pl.Series.bottom_k()。例如我們可以使用pl.Series.top_k()取得前兩大的元素，請留意返回型別仍然是pl.Series：

s1.top_k(2)

shape: (2,)
Series: 's' [i64]
[
	3
	2
]

說明文件中特別提到，返回的pl.Series並不保證會依照大小排序。使用者如果想要得到升冪或由降冪的pl.Series，需要再呼叫pl.Series.sort()，其有一個descending=參數，可以控制升降冪順序，預設值為False，即升冪排序。舉例來說，如果我們想取得前兩大的元素並依升冪排序，可以這麼寫：

s1.top_k(2).sort()

shape: (2,)
Series: 's' [i64]
[
	2
	3
]

最後我們介紹一個好用，但較少人知道的pl.Series.zip_with()。pl.Series.zip_with()接受mask=及other=兩個參數，兩者皆需為pl.Series型別。其中，mask=參數需為一布林Series，如果該行為True，則自原有Series取值；如果該行為False，則自other=取值。舉例來說，下面這段程式碼進行了s1 < s5的比較，其中第一及第三行為True，所以自s1取值，而第二行為False，所以自s6取值。

s6 = pl.Series([5, 0, 6])  # [True, False, True]
s1.zip_with(s1 < s6, s6)

shape: (3,)
Series: 's' [i64]
[
	1
	0
	3
]

如果您已經事先準備好了各行的布林結果，也可以直接傳入一個布林Series，如：

s1.zip_with(pl.Series([True, False, True]), s6)

這裡需留意，不能直接傳入列表[True, False, True]，必須是pl.Series型別的pl.Series([True, False, True])。

2. `pl.DataFrame`

pl.DataFrame是Polars表示多列的資料結構，可以想做是一個包含數個pl.Series的容器。

最常見的建構方法為傳入一個iterable給第一個參數data=。例如：

df1 = pl.DataFrame({"col1": [1, 2, 3], "col2": ["x", "y", "z"]})

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

pl.DataFrame一樣有strict=參數，不再說明。以下再介紹schema=及schema_overrides=兩個常用參數。

`schema=`

schema=就像是pl.Series中的dtype一樣，不過schema=必須指定所有列的型別，否則會回報ValueError。常見的作法是將一個字典傳給schema=，如：

df2 = pl.DataFrame(
    {"col1": [1, 2, 3], "col2": ["x", "y", "z"]},
    schema={"col1": pl.Int64, "col2": pl.String},
)
assert_frame_equal(df1, df2)

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

針對pl.DataFrame，Polars也有提供好用的測試函數assert_frame_equal()及assert_frame_not_equal()。

`schema_overrides=`

有時候，我們會希望Polars幫忙判斷大多數列的型別，但保留少數列由使用者自訂，這就是schema_overrides=的妙用。例如，以下程式碼我們使用schema_overrides=來指定「"col1"」列的型別為pl.Int64：

df3 = pl.DataFrame(
    {"col1": [1, 2, 3], "col2": ["x", "y", "z"]},
    schema_overrides={"col1": pl.Int64},
)
assert_frame_equal(df1, df3)

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

由於這與Polars自動判斷的型別相同，所以assert_frame_equal()會判斷df1與df3相等。

其它建構方法

另一種常見的建構方法是透過data=傳入資料，並於schema=傳入列名：

df4 = pl.DataFrame([[1, 2, 3], ["x", "y", "z"]], schema=["col1", "col2"])

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

其實除了pl.DataFrame外，Polars還提供了許多種pl.from_*()型式的函數來建構pl.DataFrame，例如以下的pl.from_dict()：

df5 = pl.from_dict({"col1": [1, 2, 3], "col2": ["x", "y", "z"]})
assert_frame_equal(df1, df5)

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

最後，我們再展示pl.from_repr()，可以讓我們用文字型態建構pl.DataFrame：

df6 = pl.from_repr(
    """
    shape: (3, 2)
    ┌──────┬──────┐
    │ col1 ┆ col2 │
    │ ---  ┆ ---  │
    │ i64  ┆ str  │
    ╞══════╪══════╡
    │ 1    ┆ x    │
    │ 2    ┆ y    │
    │ 3    ┆ z    │
    └──────┴──────┘
    """
)
assert_frame_equal(df1, df6)

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

這個建構方法比較適合用來快速建構一個小型的DataFrame，因為當維度太大時，Polars會自動以刪節號略去部份行或列，而這些行或列會被pl.from_repr()所忽略。

`pl.DataFrame`的屬性及函數。

pl.DataFrame的維度可以透過pl.DataFrame.shape取得：

df1.shape

(3, 2)

或是可以透過pl.DataFrame.height或pl.DataFrame.width單獨取得其行數或列數：

print(f"{df1.height=}\n{df1.width=}")

df1.height=3
df1.width=2

如果是想取得各列列名，可以使用：

df1.columns

['col1', 'col2']

如果您希望能添加一列連續數字來作為索引之用，那麼pl.DataFrame.with_row_index()將是您的好幫手：

df1.with_row_index()

shape: (3, 3)
┌───────┬──────┬──────┐
│ index ┆ col1 ┆ col2 │
│ ---   ┆ ---  ┆ ---  │
│ u32   ┆ i64  ┆ str  │
╞═══════╪══════╪══════╡
│ 0     ┆ 1    ┆ x    │
│ 1     ┆ 2    ┆ y    │
│ 2     ┆ 3    ┆ z    │
└───────┴──────┴──────┘

如果想要觀察DataFrame的前幾行，可以使用pl.DataFrame.head()

df1.head()

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

但是在列數比較多的情況下，我會建議大家試試pl.DataFrame.glimpse()，其呈現的樣式或許更符合您的需要：

df1.glimpse()

Rows: 3
Columns: 2
$ col1 <i64> 1, 2, 3
$ col2 <str> 'x', 'y', 'z'

遍歷列

遍歷列在大多情況下是一種anti-pattern。但如果真的想進行此操作行，可以使用pl.DataFrame.iter_columns()，例如：

for ser in df1.iter_columns():
    print(ser, end="\n\n")

shape: (3,)
Series: 'col1' [i64]
[
	1
	2
	3
]

shape: (3,)
Series: 'col2' [str]
[
	"x"
	"y"
	"z"
]

如果是想將整個DataFrame變為多個Series組成的列表，可以使用pl.DataFrame.get_columns()：

df1.get_columns()

[shape: (3,)
 Series: 'col1' [i64]
 [
 	1
 	2
 	3
 ],
 shape: (3,)
 Series: 'col2' [str]
 [
 	"x"
 	"y"
 	"z"
 ]]

遍歷行

遍歷行在大多情況下是一種anti-pattern。但如果真的想進行此操作行，可以使用pl.DataFrame.iter_rows()，例如：

for row in df1.iter_rows():
    print(row)

(1, 'x')
(2, 'y')
(3, 'z')

值得一提的是，預設的name=參數為False，會回傳元組。如果想要回傳字典的話，可以將name=設為True，雖然會增加運算，但方便使用列名取值。

for row in df1.iter_rows(named=True):
    print(row["col1"], row["col2"])

1 x
2 y
3 z

3. `pl.Series`與`pl.DataFrame`的相互轉換

pl.Series可以透過pl.Series.to_frame()轉換為pl.DataFrame。例如：

s1.to_frame()

shape: (3, 1)
┌─────┐
│ s   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
└─────┘

pl.DataFrame可以透過pl.DataFrame.to_series()轉換為pl.Series。例如：

df1.to_series()

shape: (3,)
Series: 'col1' [i64]
[
	1
	2
	3
]

4. 合併`pl.DataFrame`與`pl.Series`

使用pl.DataFrame.with_columns()可以將pl.DataFrame與pl.Series合併為新dataframe，例如將df1與s1合併為新dataframe：

df1.with_columns(s1)

shape: (3, 3)
┌──────┬──────┬─────┐
│ col1 ┆ col2 ┆ s   │
│ ---  ┆ ---  ┆ --- │
│ i64  ┆ str  ┆ i64 │
╞══════╪══════╪═════╡
│ 1    ┆ x    ┆ 1   │
│ 2    ┆ y    ┆ 2   │
│ 3    ┆ z    ┆ 3   │
└──────┴──────┴─────┘

5. 合併多個`pl.Series`或`pl.DataFrame`

pl.concat()可以讓我們快速合併pl.Series及pl.DataFrame。

合併多個`pl.Series`

以下展示使用pl.concat()合併s_v1及s_v2：

s_v1 = pl.Series("s_v1", [1, 2, 3])
s_v2 = pl.Series("s_v2", [4, 5, 6])
pl.concat([s_v1, s_v2])

shape: (6,)
Series: 's_v1' [i64]
[
	1
	2
	3
	4
	5
	6
]

請留意，新pl.Series之名將使用第一個pl.Series之名。

合併多個`pl.DataFrame`

pl.concat()可以使用how=參數來控制多個dataframe進行垂直（how="vertical"）或是水平合併（how="horizontal"），預設為垂直合併。

以下展示垂直合併df_v1及df_v2（how="vertical"）：

df_v1 = pl.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
df_v2 = pl.DataFrame({"col1": [7, 8, 9], "col2": [10, 11, 12]})
pl.concat([df_v1, df_v2], how="vertical")

shape: (6, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 4    │
│ 2    ┆ 5    │
│ 3    ┆ 6    │
│ 7    ┆ 10   │
│ 8    ┆ 11   │
│ 9    ┆ 12   │
└──────┴──────┘

以下展示水平合併df_h1及df_h2（how="horizontal"）：

df_h1 = pl.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
df_h2 = pl.DataFrame({"col3": [7, 8, 9], "col4": [10, 11, 12]})
pl.concat([df_h1, df_h2], how="horizontal")

shape: (3, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╪══════╡
│ 1    ┆ 4    ┆ 7    ┆ 10   │
│ 2    ┆ 5    ┆ 8    ┆ 11   │
│ 3    ┆ 6    ┆ 9    ┆ 12   │
└──────┴──────┴──────┴──────┘

如果需要進行比較複雜的合併，需使用pl.DataFrame.join()，將於[Day17]說明。

6. `codepanda`

Pandas與Polars最大的不同點之一，是Pandas高度依賴其索引列進行對齊運算。而Polars沒有索引概念，因此pl.Series及pl.DataFrame級別的操作比較少見，反而更常見的是使用context搭配expr進行操作。

此外，Pandas的函數命名大多相連在一起，而Polars一般會使用_分開。例如想要檢查字串開頭是否為特定字串的話，在Pandas中會使用pd.Series.str.startswith()，而在Polars中會使用pl.Expr.str.starts_with()。

Code

本日程式碼傳送門。

[Day03] - Polars帶來了什麼便利

[Day05] - pl.col

系列文

Polars熊霸天下共 30 篇

RSS系列文訂閱系列文

1 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19864 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

Polars熊霸天下系列 第 4 篇

[Day04] - pl.Series與pl.DataFrame

0. 本日引入模組及準備工作

1. pl.Series

dtype=

strict=

pl.Series的屬性及函數。

2. pl.DataFrame

schema=

schema_overrides=

其它建構方法

pl.DataFrame的屬性及函數。

遍歷列

遍歷行

3. pl.Series與pl.DataFrame的相互轉換

4. 合併pl.DataFrame與pl.Series

5. 合併多個pl.Series或pl.DataFrame

合併多個pl.Series

合併多個pl.DataFrame

6. codepanda