[Day06] - pl.Expr與selectors - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 6

Software Development

Polars熊霸天下系列第 6 篇

[Day06] - pl.Expr與selectors

17th鐵人賽 python polars

Jerry Wu

2025-09-12 00:06:57

132 瀏覽

分享至

今天我們來認識pl.Expr，並介紹selectors這個快速選擇列的利器。

本日大綱如下：

本日引入模組及準備工作
pl.Expr
selectors
codepanda

0. 本日引入模組及準備工作

from datetime import date

import pandas as pd
import polars.selectors as cs

data = {
    "col1": [1, 2, 3],
    "col2": [4.1, 5.2, 6.3],
    "col3": ["x", "y", "z"],
    "col4": pl.date_range(
        date(2022, 1, 1), date(2022, 3, 1), "1mo", eager=True
    ).alias("date"),
    "col5": [True, True, True],
}

shape: (3, 5)
┌──────┬──────┬──────┬────────────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4       ┆ col5 │
│ ---  ┆ ---  ┆ ---  ┆ ---        ┆ ---  │
│ i64  ┆ f64  ┆ str  ┆ date       ┆ bool │
╞══════╪══════╪══════╪════════════╪══════╡
│ 1    ┆ 4.1  ┆ x    ┆ 2022-01-01 ┆ true │
│ 2    ┆ 5.2  ┆ y    ┆ 2022-02-01 ┆ true │
│ 3    ┆ 6.3  ┆ z    ┆ 2022-03-01 ┆ true │
└──────┴──────┴──────┴────────────┴──────┘

1. `pl.Expr`

pl.Expr是Polars表示運算的表達式，常用的使用方式有以下三種：

將expr定義於context內。
將expr指定為變數。
將expr嵌入在函數中。

舉例來說，如果我們想將「"col1"」列加上1，可以將expr定義於context內：

df.select(pl.col("col1").add(1))

也可以將此expr指定為一變數，「"col1_add_one"」：

col1_add_one = pl.col("col1").add(1)

df.select(col1_add_one)

或是將此expr嵌入在函數，「"add_one"」中：

def add_one(col: str, alias: str | None = None) -> pl.Expr:
    expr = pl.col(col).add(1)
    if alias is not None:
        expr = expr.alias(alias)
    return expr


df.select(add_one("col1"))

三段程式碼皆會得到相同的結果：

shape: (3, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 2    │
│ 3    │
│ 4    │
└──────┘

此外，同一個expr可能會在不同context中產生不同的結果。例如，我們將pl.col("col3").eq("y")定義為一變數，「"col3_eq_y"」，然後分別觀察其於不同context內所得到的結果：

col3_eq_y = pl.col("col3").eq("y")

在pl.DataFrame.select()中，我們得到了一列以布林值表達「"col3"」列是否等於「"y"」，得到了一個長度不變的dataframe：

df.select(col3_eq_y)

shape: (3, 1)
┌───────┐
│ col3  │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ true  │
│ false │
└───────┘

在pl.DataFrame.filter()中，我們將「"col3"」列等於「"y"」的行保留下來，而得到了一個長度較短的dataframe：

df.filter(col3_eq_y)

shape: (1, 5)
┌──────┬──────┬──────┬────────────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4       ┆ col5 │
│ ---  ┆ ---  ┆ ---  ┆ ---        ┆ ---  │
│ i64  ┆ f64  ┆ str  ┆ date       ┆ bool │
╞══════╪══════╪══════╪════════════╪══════╡
│ 2    ┆ 5.2  ┆ y    ┆ 2022-02-01 ┆ true │
└──────┴──────┴──────┴────────────┴──────┘

在pl.DataFrame.group_by().agg()中，我們以"col3"」列是否等於「"y"」進行分組，並將各列各自聚合為pl.List，得出了一個長度較短的dataFrame：

df.group_by(col3_eq_y).agg(pl.all())

shape: (2, 5)
┌───────┬───────────┬────────────┬──────────────────────────┬──────────────┐
│ col3  ┆ col1      ┆ col2       ┆ col4                     ┆ col5         │
│ ---   ┆ ---       ┆ ---        ┆ ---                      ┆ ---          │
│ bool  ┆ list[i64] ┆ list[f64]  ┆ list[date]               ┆ list[bool]   │
╞═══════╪═══════════╪════════════╪══════════════════════════╪══════════════╡
│ true  ┆ [2]       ┆ [5.2]      ┆ [2022-02-01]             ┆ [true]       │
│ false ┆ [1, 3]    ┆ [4.1, 6.3] ┆ [2022-01-01, 2022-03-01] ┆ [true, true] │
└───────┴───────────┴────────────┴──────────────────────────┴──────────────┘

2. `selectors`

從源碼可以看出，selectors的核心_selector_proxy_是繼承Expr而來（註1）：

# polars/py-polars/polars/selectors.py

class _selector_proxy_(Expr):
    """Base column selector expression/proxy."""

在selectors.py中，Polars提供了許多快捷函數，其於底層呼叫_selector_proxy_，來幫助我們快速選取列。舉例來說，如果想要一次選擇所有整數型別的列，可以這麼寫：

df.select(cs.integer())

shape: (3, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
└──────┘

其中的integer()函數源碼如下：

from polars import functions as F
from polars.datatypes.group import INTEGER_DTYPES

def integer() -> SelectorType:
    return _selector_proxy_(F.col(INTEGER_DTYPES), name="integer")

_selector_proxy_幫助我們選取所有INTEGER_DTYPES型別的列。如果使用pl.col()的話，相當於：

from polars.datatypes.group import INTEGER_DTYPES

df.select(pl.col(INTEGER_DTYPES))

shape: (3, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
└──────┘

各式`selectors`

selectors的種類很多，舉例來說，如果想要選取所有時間型別的列，可以使用cs.temporal()：

df.select(cs.temporal())

shape: (3, 1)
┌────────────┐
│ col4       │
│ ---        │
│ date       │
╞════════════╡
│ 2022-01-01 │
│ 2022-02-01 │
│ 2022-03-01 │
└────────────┘

如果想要選取所有數值型別的列，可以使用cs.numeric()：

df.select(cs.numeric())

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ f64  │
╞══════╪══════╡
│ 1    ┆ 4.1  │
│ 2    ┆ 5.2  │
│ 3    ┆ 6.3  │
└──────┴──────┘

Set operations

shape: (5, 2)
┌──────────┬──────────────────────┐
│ Operator ┆ Operation            │
│ ---      ┆ ---                  │
│ str      ┆ str                  │
╞══════════╪══════════════════════╡
│ A | B    ┆ Union                │
│ A & B    ┆ Intersection         │
│ A - B    ┆ Difference           │
│ A ^ B    ┆ Symmetric difference │
│ ~A       ┆ Complement           │
└──────────┴──────────────────────┘

selectors最強大的一點是提供了五種set operations，讓我們可以方便選取各種情況下的列（註2）。

Union

如果想要同時選出所有數值及字串型別的列，可以這麼寫：

df.select(cs.numeric() | cs.string())

shape: (3, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ 4.1  ┆ x    │
│ 2    ┆ 5.2  ┆ y    │
│ 3    ┆ 6.3  ┆ z    │
└──────┴──────┴──────┘

Intersection

如果想要同時選出位於兩種集合中的列，例如想要選擇位於A集合（數值型別集合）且也位於B集合（「"col2"」及「"col5"」列集合）中的列，可以這麼寫：

df.select(cs.numeric() & cs.by_name("col2", "col5"))

shape: (3, 1)
┌──────┐
│ col2 │
│ ---  │
│ f64  │
╞══════╡
│ 4.1  │
│ 5.2  │
│ 6.3  │
└──────┘

Difference

如果想要同時選出所有數值型別但排除整數型別的列，可以這麼寫：

df.select(cs.numeric() - cs.integer())

shape: (3, 1)
┌──────┐
│ col2 │
│ ---  │
│ f64  │
╞══════╡
│ 4.1  │
│ 5.2  │
│ 6.3  │
└──────┘

Symmetric difference

有時候我們會只想要選出不同時屬於兩種集合的列。例如想要選擇只位於A集合（整數型別集合）或只位於B集合（「"col2"」及「"col5"」列集合）中的列，可以這麼寫：

df.select(cs.numeric() ^ (cs.by_name("col2", "col5")))

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col5 │
│ ---  ┆ ---  │
│ i64  ┆ bool │
╞══════╪══════╡
│ 1    ┆ true │
│ 2    ┆ true │
│ 3    ┆ true │
└──────┴──────┘

Complement

如果想要同時選出所有非時間相關型別的列，可以這麼寫：

df.select(~cs.temporal())

shape: (3, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col5 │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ str  ┆ bool │
╞══════╪══════╪══════╪══════╡
│ 1    ┆ 4.1  ┆ x    ┆ true │
│ 2    ┆ 5.2  ┆ y    ┆ true │
│ 3    ┆ 6.3  ┆ z    ┆ true │
└──────┴──────┴──────┴──────┘

`as_expr()`

as_expr()是一個有趣的功能，可以將selectors轉換為expr。這將使得我們可以將各種操作符號的效力，加在資料上而非選擇後的集合。

例如如果想將~符號，施加在將所有布林列，也就是想要將所有布林列的True變為False並將Flase變為True，不能寫成：

df.select(~cs.boolean())

shape: (3, 4)
┌──────┬──────┬──────┬────────────┐
│ col1 ┆ col2 ┆ col3 ┆ col4       │
│ ---  ┆ ---  ┆ ---  ┆ ---        │
│ i64  ┆ f64  ┆ str  ┆ date       │
╞══════╪══════╪══════╪════════════╡
│ 1    ┆ 4.1  ┆ x    ┆ 2022-01-01 │
│ 2    ┆ 5.2  ┆ y    ┆ 2022-02-01 │
│ 3    ┆ 6.3  ┆ z    ┆ 2022-03-01 │
└──────┴──────┴──────┴────────────┘

因為~cs.boolean()是代表所有的非布林列。

正確的作法是將~施加在(cs.boolean().as_expr())上，如：

df.select(~(cs.boolean().as_expr()))

shape: (3, 1)
┌───────┐
│ col5  │
│ ---   │
│ bool  │
╞═══════╡
│ false │
│ false │
│ false │
└───────┘

此時，cs.boolean()代表了所有布林列，而as_expr()將其轉換為expr，這麼一來~就可以作用在expr上而不是選擇後的集合。

3. `codepanda`

Pandas中相對應Polars中的selectors功能為pd.DataFrame.select_dtypes()及pd.DataFrame.filter()。

例如如果想選擇所有數值格式的列可以這麼寫：

df_pd = pd.DataFrame(data)

df_pd.select_dtypes("number")

   col1  col2
0     1   4.1
1     2   5.2
2     3   6.3

如果想選擇所有以col作為開頭的列，但想排除「"col4"」及「"col5"」列時，可以這麼寫：

df_pd.filter(regex="^col(?!4$|5$)", axis=1)

   col1  col2 col3
0     1   4.1    x
1     2   5.2    y
2     3   6.3    z

備註

註1：需留意，selectors並沒有提供+這種set operations，如果進行+操作，會報錯如下：

❌
df.select(cs.numeric() + cs.string())
# TypeError: unsupported operand type(s) for op: ('Selector' + 'Selector')

註2：Polars近期對內部程式進行了大量重構（大約是v1.32.0後），原先的_selector_proxy_已經被刪，其功能大多轉移至Selector class。

Code

本日程式碼傳送門。

[Day05] - pl.col

[Day07] - Datatype：多種基本型別及缺失值處理

系列文

Polars熊霸天下共 30 篇

RSS系列文訂閱系列文

1 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19864 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

Polars熊霸天下系列 第 6 篇