[Day05] - pl.col - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 5

Software Development

Polars熊霸天下系列第 5 篇

[Day05] - pl.col

17th鐵人賽 python polars

Jerry Wu

2025-09-11 00:00:40

77 瀏覽

分享至

今天我們來了解pl.col，作為學習expr前的準備工作。

本日大綱如下：

本日引入模組及準備工作
經典使用方式
便捷使用方式
codepanda

0. 本日引入模組及準備工作

import pandas
import polars as pl

data = {"col1": [1, 2, 3], "col2": ["x", "y", "z"]}
df = pl.DataFrame(data)

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

原理

pl.col不是一個函數，而是Col的instance，我們可以從源碼中觀察出來：

# polars/py-polars/polars/functions/col.py

class Col:
    def __call__(...) -> Expr:
        return _create_col(name, *more_names)


col: Col = Col()

在我們呼叫pl.col時，實際上是呼叫了Col中的__call__()，而其會再呼叫另一個內部函數_create_col()來生成expr。由於pl.col是一個callable，所以實際操作上，我們可以將pl.col視為函數使用。

1. 經典使用方式

假如我們想表達單列，可以將該列的名字置入其中，例如想表達「"col1"」列：

pl.col("col1")

這樣就生成了一個表達「"col1"」列的expr了。由於expr需置於context中才能發揮作用，以下我們使用pl.DataFrame.select() context來進行說明。

單列

如果將pl.col("col1")置於pl.DataFrame.select()中，可以實際選擇該列：

df.select(pl.col("col1"))

shape: (3, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
└──────┘

多列

如果想選擇多列的話，可以一次傳入多個列名，例如同時選擇「"col1"」及「"col2"」列：

df.select(pl.col("col1", "col2"))

或是傳入一個含有多個列名的列表：

df.select(pl.col(["col1", "col2"]))

兩種寫法皆會產生相同的結果。

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

但是混合使用是不被允許的：

❌
df.select(pl.col("col1", ["col2"]))
# TypeError: argument 'names': 'list' object cannot be converted to 'PyString'

所有列

如果想要一次選取該context內所有列的話，可以使用*代表，例如：

df.select(pl.col("*"))

或者使用Polars提供的快捷函數pl.all()（註1）：

df.select(pl.all())

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

兩者都可以選擇到所有列。

正則表達式

Polars的正則表達式寫法與純Python略有不同，需要參考Rust的regex crate 說明文件。

如果想選擇全部開頭為「"col"」的列，可以這麼寫：

df.select(pl.col("^col.*$"))

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

型別選擇

如果想選擇某一型別的所有列，例如所有pl.String 型別的列，可以這麼寫：

df.select(pl.col(pl.String))

shape: (3, 1)
┌──────┐
│ col2 │
│ ---  │
│ str  │
╞══════╡
│ x    │
│ y    │
│ z    │
└──────┘

當然，也支援選取多個型別的所有列，例如同時選取pl.String及pl.Int64型別的所有列：

df.select(pl.col(pl.String, pl.Int64))

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

2. 便捷使用方式

其實Polars允許使用.這種選取attribute的方式來選取列，我們可以從源碼中觀察出，Col的__getattr__其實也是於底層呼叫_create_col()：

# polars/py-polars/polars/functions/col.py

class Col:
    def __getattr__(self, name: str) -> Expr:
        ...
        return _create_col(name)

如果想要選取「"col1"」列的話，可以這麼寫：

df.select(pl.col.col1)

shape: (3, 1)
┌──────┐
│ col1 │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
└──────┘

如果想要選取「"col1"」及「"col2"」列的話，可以這麼寫：

df.select(pl.col.col1, pl.col.col2)

# or
df.select([pl.col.col1, pl.col.col2])

shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ str  │
╞══════╪══════╡
│ 1    ┆ x    │
│ 2    ┆ y    │
│ 3    ┆ z    │
└──────┴──────┘

但API文件中明確指出，pl.col是比較正式的寫法，使用attribute選取列的方式應該僅限於prototype中使用。

The function call syntax is considered the idiomatic way of constructing a column expression. The alternative attribute syntax can be useful for quick prototyping as it can save some keystrokes, but has drawbacks in both expressiveness and readability.

3. `codepanda`

Pandas中相對應Polars中pl.col的功能是callable，一般以lambda型式呈現。例如想要將「"col1"」列加上1，我們會使用pd.DataFrame.assign()搭配lambda來完成：

df_pd = pd.DataFrame(data)

df_pd.assign(col1=lambda df_: df_.col1.add(1))

   col1 col2
0     2    x
1     3    y
2     4    z

備註

註1：Polars還提供了許多其它的快捷函數。舉例來說，如果我們想對「"col1"」列進行加總的話，除了使用pl.col("col1").sum()以外，也可以使用快捷函數pl.sum("col1")來達成：

(
    df.select(
        pl.col("col1").sum().alias("pl_col_sum"),
        pl.sum("col1").alias("pl_sum")
    )
)

shape: (1, 2)
┌────────────┬────────┐
│ pl_col_sum ┆ pl_sum │
│ ---        ┆ ---    │
│ i64        ┆ i64    │
╞════════════╪════════╡
│ 6          ┆ 6      │
└────────────┴────────┘