[Day18] - 進階操作分享 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 18

Software Development

Polars熊霸天下系列第 18 篇

[Day18] - 進階操作分享

17th鐵人賽 python polars

Jerry Wu

2025-09-24 00:02:49

85 瀏覽

分享至

今天我們來分享一些Polars的進階操作。

本日大綱如下：

本日引入模組及準備工作
邏輯判斷：pl.when().then().otherwise()
元素替換：pl.Expr.replace()與pl.Expr.replace_strict()
pl.String串接：pl.concat_str()
pl.List串接：pl.concat_list()
函數串接：pl.DataFrame.pipe()
codepanda

0. 本日引入模組及準備工作

from typing import Callable, Literal
import polars as pl

df = pl.DataFrame({"col1": [1, 2, 3], "col2": ["x", "y", "z"]})

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

df2 = pl.DataFrame({"col 1": [1, 2, 3], "col 2": ["x", "y", "z"]})

shape: (3, 2)
┌───────┬───────┐
│ col 1 ┆ col 2 │
│ ---   ┆ ---   │
│ i64   ┆ str   │
╞═══════╪═══════╡
│ 1     ┆ x     │
│ 2     ┆ y     │
│ 3     ┆ z     │
└───────┴───────┘

1. 邏輯判斷：`pl.when().then().otherwise()`（*1）

pl.when().then().otherwise()是Polars中的邏輯判斷功能，可以視為Polars的if-elif-else。

舉例來說，下面這個例子包含了三重邏輯判斷：

「"col1"」列之值小於或等於1時，加100。
「"col1"」列之值大於或等於3時，加300。
「"col1"」列之值不符合上述兩個條件時，加200。

(
    df.with_columns(
        pl.when(pl.col("col1").le(1))
        .then(pl.col("col1").add(100))
        .when(pl.col("col1").ge(3))
        .then(pl.col("col1").add(300))
        .otherwise(pl.col("col1").add(200))
        .alias("col3")
    )
)

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ x    ┆ 101  │
│ 2    ┆ y    ┆ 202  │
│ 3    ┆ z    ┆ 303  │
└──────┴──────┴──────┘

由於邏輯判斷一般較為冗長，所以大家常常將其定義為變數或是寫為函數，例如：

cond = (
    pl.when(pl.col("col1").le(1))
    .then(pl.col("col1").add(100))
    .when(pl.col("col1").ge(3))
    .then(pl.col("col1").add(300))
    .otherwise(pl.col("col1").add(200))
    .alias("col3")
)

df.with_columns(cond)

由於pl.when().then().otherwise()也支援「等於」的運算，所以如果想將col1中的值換為「"a"、"b"、"c"」的話，可以這麼寫：

(
    df.with_columns(
        pl.when(pl.col("col1").eq(1))
        .then(pl.lit("a"))
        .when(pl.col("col1").eq(2))
        .then(pl.lit("b"))
        .when(pl.col("col1").eq(3))
        .then(pl.lit("c"))
        .alias("col3")
    )
)

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ x    ┆ a    │
│ 2    ┆ y    ┆ b    │
│ 3    ┆ z    ┆ c    │
└──────┴──────┴──────┘

請注意，如果是想表達固定值時，需要使用pl.lit()，否則Polars會認為是列名。

使用pl.when().then().otherwise()進行元素替換的話，一般都會寫得很長一串，比較便捷的作法是使用pl.Expr.replace()與pl.Expr.replace_strict()。

2. 元素替換：`pl.Expr.replace()`與`pl.Expr.replace_strict()`

pl.Expr.replace()接受一個字典作為參數，可以快速完成元素替換，例如：

(
    df.with_columns(
        pl.col("col2")
        .replace({"x": "a", "y": "b", "z": "c"})
        .alias("col3")
    )
)

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ x    ┆ a    │
│ 2    ┆ y    ┆ b    │
│ 3    ┆ z    ┆ c    │
└──────┴──────┴──────┘

pl.Expr.replace_strict()使用方式與pl.Expr.replace()類似，但多接受了default=及return_dtype=兩個參數。

default=可以指定預設值，而return_dtype=可以指定回傳的型別。例如，default=6可以幫助我們將「"x"」與「"y"」以外的值取代為6，return_dtype=pl.Int64可以指定回傳型別為pl.Int64，而不是原先的pl.String型別。

(
    df.with_columns(
        pl.col("col2")
        .replace_strict({"x": 4, "y": 5}, default=6, return_dtype=pl.Int64)
        .alias("col3")
    )
)

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ x    ┆ 4    │
│ 2    ┆ y    ┆ 5    │
│ 3    ┆ z    ┆ 6    │
└──────┴──────┴──────┘

3. `pl.String`串接：`pl.concat_str()`

使用pl.concat_str()可以串接不同列為一型別為pl.String的新列，例如：

df.with_columns(pl.concat_str(pl.all()).alias("col3"))

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ str  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ x    ┆ 1x   │
│ 2    ┆ y    ┆ 2y   │
│ 3    ┆ z    ┆ 3z   │
└──────┴──────┴──────┘

請留意，pl.concat_str()會自動將不是pl.String型別的列轉換為pl.String後再合併。例如此處的「"col1"」列型別雖然是pl.Int64，但也可以成功與「"col2"」列合併，進而生成「"col3"」列。

4. `pl.List`串接：`pl.concat_list()`

pl.concat_list()可以串接多列為pl.List型別，例如：

df.with_columns(pl.concat_list(pl.all()).alias("col3"))

shape: (3, 3)
┌──────┬──────┬────────────┐
│ col1 ┆ col2 ┆ col3       │
│ ---  ┆ ---  ┆ ---        │
│ i64  ┆ str  ┆ list[str]  │
╞══════╪══════╪════════════╡
│ 1    ┆ x    ┆ ["1", "x"] │
│ 2    ┆ y    ┆ ["2", "y"] │
│ 3    ┆ z    ┆ ["3", "z"] │
└──────┴──────┴────────────┘

由於pl.List內之元素須為同一型別，pl.concat_list()會自動幫我們選擇最合適的型別，即「"col3"」列的pl.String型別。

順道一提，如果是要將一列pl.String拆為pl.List型別的話，可以使用pl.Expr.str.split()，例如：

(
    df.select(pl.concat_str(pl.all()).alias("col3")).with_columns(
        pl.col("col3").str.split("").alias("col4")
    )
)

shape: (3, 2)
┌──────┬────────────┐
│ col3 ┆ col4       │
│ ---  ┆ ---        │
│ str  ┆ list[str]  │
╞══════╪════════════╡
│ 1x   ┆ ["1", "x"] │
│ 2y   ┆ ["2", "y"] │
│ 3z   ┆ ["3", "z"] │
└──────┴────────────┘

請留意，此處需要記得將空字串「""」指定為pl.Expr.str.split()的第一個參數，by=。

5. 函數串接：`pl.DataFrame.pipe()`（*2）

最後，我們來介紹pl.DataFrame.pipe()。pl.DataFrame.pipe()是我們可以接續pl.DataFrame各種操作的秘密武器，其首個參數為一函數，其後則為該函數所需的參數，包含位置引數與關鍵字引數。由於該函數必須以當前的pl.DataFrame作為第一個引數，所以這相當於提供了一個接口給使用者，進行各種操作。舉例來說，如果我們想將所有列名中的空白去除，且將首字母大寫，可以這麼寫（註1）：

def fmt_col(df_: pl.DataFrame) -> pl.DataFrame:
    df_.columns = ["".join(c.split()).capitalize() for c in df_.columns]
    return df_


df2.pipe(fmt_col)

shape: (3, 2)
┌──────┬──────┐
│ Col1 ┆ Col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

接下來，我們進一步將fmt_col()改進為fmt_col2()，使其可以接受fmt_type=參數來指定字串轉換方式。舉例來說，下面我們展示將所有列名中的空白去除，且讓使用者透過fmt_type=參數來指定全部大寫、全部小寫或是首字母大寫三種字串轉換型態：

def fmt_col2(
    df_, fmt_type: Literal["upper", "lower", "capitalize"] | None = None
) -> pl.DataFrame:
    fmt_func: Callable[[str], str] = lambda x: x
    if fmt_type in {"upper", "lower", "capitalize"}:
        fmt_func = getattr(str, fmt_type)

    df_.columns = [fmt_func("".join(c.split())) for c in df_.columns]
    return df_


df2.pipe(fmt_col2, fmt_type="capitalize")

shape: (3, 2)
┌──────┬──────┐
│ Col1 ┆ Col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

透過這種不斷使用pl.DataFrame.pipe()的方式，讓我們能串接各種不同的客製化函數，是一種很具結構性的寫法。

想了解更多這種寫法的朋友，可以參考此函數的PR貢獻者，Vincent D. Warmerdam，於marimo官方YouTube頻道的實戰應用分享。

6. `codepanda`

*1. 邏輯判斷：`pd.Series.case_when()`

在Pandas中相對應pl.when().then().otherwise()的功能為pd.Series.case_when()。

舉例來說，下面這個例子包含了三重邏輯判斷：

「"col1"」之值小於或等於1時，加100。
「"col1"」之值大於或等於3時，加300。
「"col1"」之值不符合上述兩個條件時，加200。

df_pd = pd.DataFrame({"col1": [1, 2, 3], "col2": ["x", "y", "z"]})

(
    df_pd.assign(
        col3=lambda df_: df_.col1.case_when(
            [
                (df_.col1.le(1), df_.col1.add(100)),
                (df_.col1.ge(3), df_.col1.add(300)),
                ((~df_.col1.le(1)) & (~df_.col1.ge(3)), df_.col1.add(200)),
            ]
        )
    )
)

   col1 col2  col3
0     1    x   101
1     2    y   202
2     3    z   303

*2. 函數串接：`pd.DataFrame.pipe()`

Pandas也有提供pd.DataFrame.pipe()，作為函數串接之用。

備註

註1：比較符合Polars設計原則的方法，是使用pl.Expr.name命名空間來重新指定列名，如：

(
    df2.select(
        pl.all().name.map(lambda c: "".join(c.split()).capitalize())
    )
)

shape: (3, 2)
┌──────┬──────┐
│ Col1 ┆ Col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘