[Day15] - 排序 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 15

Software Development

Polars熊霸天下系列第 15 篇

[Day15] - 排序

17th鐵人賽 python polars

Jerry Wu

2025-09-21 00:10:17

63 瀏覽

分享至

排序在Polars中是相當重要的概念，因為一旦能夠確定該資料結構是有序的（無論是pl.Series、pl.DataFrame或是pl.Expr），將可以進行許多高效的運算。

本日大綱如下：

本日引入模組及準備工作
set_sorted()
pl.Series
pl.Dataframe
pl.Expr
codepanda

0. 本日引入模組及準備工作

import polars as pl


s = pl.Series("s", [2, 3, 1, None], dtype=pl.Int64)

shape: (4,)
Series: 's' [i64]
[
	2
	3
	1
	null
]

df = pl.DataFrame(
    {
        "col1": [3, 2, 2, 1],
        "col2": [6.0, 5.0, 7.0, 4.0],
        "col3": ["a", "c", "c", "b"],
    }
)

shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ str  │
╞══════╪══════╪══════╡
│ 3    ┆ 6.0  ┆ a    │
│ 2    ┆ 5.0  ┆ c    │
│ 2    ┆ 7.0  ┆ c    │
│ 1    ┆ 4.0  ┆ b    │
└──────┴──────┴──────┘

df2 = pl.from_repr(
    """
    shape: (4, 3)
    ┌──────┬──────┬──────┐
    │ col1 ┆ col2 ┆ col3 │
    │ ---  ┆ ---  ┆ ---  │
    │ i64  ┆ f64  ┆ str  │
    ╞══════╪══════╪══════╡
    │ 12   ┆ 15.0 ┆ a    │
    │ 13   ┆ 14.0 ┆ c    │
    │ 12   ┆ 17.0 ┆ c    │
    │ 11   ┆ 16.0 ┆ b    │
    └──────┴──────┴──────┘
    """
)

shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ str  │
╞══════╪══════╪══════╡
│ 12   ┆ 15.0 ┆ a    │
│ 13   ┆ 14.0 ┆ c    │
│ 12   ┆ 17.0 ┆ c    │
│ 11   ┆ 16.0 ┆ b    │
└──────┴──────┴──────┘

1. `set_sorted()`

首先，我們一起來了解Polars是如何判斷各種資料結構是否為有序排列。

原來pl.Series及pl.DataFrame都有一個flags的屬性，來表達其是否為有序，以下我們使用pl.Series做範例說明：

s.flags

{'SORTED_ASC': False, 'SORTED_DESC': False}

可以看出pl.Series.flags會返回一個字典，其內有兩個key，分別記錄其是否為升冪或降冪排列。

我們可以使用pl.Series.set_sorted()來告知Polars，其已是有序的（預設descending=False，即升冪排序）：

❗
s1 = s.set_sorted()
s1.flags

{'SORTED_ASC': True, 'SORTED_DESC': False}

這邊有兩點需要特別注意：

執行s1 = s.set_sorted()後，s1的SORTED_ASC會變為True，但s本身的flags並不會變動。這樣的結果，符合Polars進行操作時，不會mutate原始資料結構的原則。
如果仔細觀察s，會發現其並不是一個有序的series。由於此錯誤資訊，將使得後續運算出現不正確的結果。例如計算s1.max()會得到「1」，這是因為升冪排序下，最大值必定出現在最後，也就是說Polars在沒有實際運算的情況下，回報了最後一個元素值。

除了觀察flags回傳的字典外，針對pl.Series型別，Polars提供了pl.Series.is_sorted()來協助使用者判斷其是否為有序排列。例如，以下程式再次驗證了s並沒有受到s1 = s.set_sorted()的影響而變為有序排列：

s.is_sorted()

False

2. `pl.Series`

pl.Series.sort()可以幫助我們進行排序（預設descending=False，即升冪排序）：

s.sort()

shape: (4,)
Series: 's' [i64]
[
	null
	1
	2
	3
]

pl.Series.sort()有一個nulls_last=參數預設值為False，如果將其設為True，則可以將缺失值排在最後：

s.sort(nulls_last=True)

shape: (4,)
Series: 's' [i64]
[
	1
	2
	3
	null
]

`pl.Series.search_sorted()`

pl.Series.search_sorted()可以幫助我們找出所要插入元素的索引值。其有一個side=參數，可以用來控制元素所要插入的位置。例如，如果想要指定side=為「"right"」，並試圖找出插入值「2」所要插入的索引值，可以這麼寫：

s.sort().search_sorted(2, "right")

由於side=指定為「"right"」，所以插入值「2」會插入在s.sort()中的「2」後面，也就是第三個元素（前面有「null」、「1」及「2」，後面有「3」）。

3. `pl.DataFrame`

pl.DataFrame.sort()可以幫助我們進行排序（預設descending=False，即升冪排序）。例如針對「"col1"」列進行排序：

df.sort("col1")

shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ 4.0  ┆ b    │
│ 2    ┆ 5.0  ┆ c    │
│ 2    ┆ 7.0  ┆ c    │
│ 3    ┆ 6.0  ┆ a    │
└──────┴──────┴──────┘

除了針對列之外，pl.DataFrame.sort()也可以針對expr進行排序，例如：

df.sort(pl.col("col1").add(pl.col("col2").mul(2)))

shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ str  │
╞══════╪══════╪══════╡
│ 1    ┆ 4.0  ┆ b    │
│ 2    ┆ 5.0  ┆ c    │
│ 3    ┆ 6.0  ┆ a    │
│ 2    ┆ 7.0  ┆ c    │
└──────┴──────┴──────┘

當針對expr進行排序時，由於其不是列，所以使用者需要事先想像expr運算後之結果，對初學者來說不太容易。

pl.DataFrame.sort()允許同時對多列進行排序，且可以傳入一個列表給descending=來控制各列為升冪或降冪排序，例如：

df.sort("col3", "col2", descending=[True, False])

shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ str  │
╞══════╪══════╪══════╡
│ 2    ┆ 5.0  ┆ c    │
│ 2    ┆ 7.0  ┆ c    │
│ 1    ┆ 4.0  ┆ b    │
│ 3    ┆ 6.0  ┆ a    │
└──────┴──────┴──────┘

`pl.DataFrame.merge_sorted()`

pl.DataFrame.merge_sorted()可以合併兩個升冪排序的dataframe，且合併過後的dataframe依然會是升冪排序，但兩個dataframe的schema必須一致。例如df.sort("col3")與df2.sort("col3")都針對「"col3"」列進行升冪排序，並以「"col3"」列為合併對象：

df.sort("col3").merge_sorted(df2.sort("col3"), key="col3")

shape: (8, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ str  │
╞══════╪══════╪══════╡
│ 3    ┆ 6.0  ┆ a    │
│ 12   ┆ 15.0 ┆ a    │
│ 1    ┆ 4.0  ┆ b    │
│ 11   ┆ 16.0 ┆ b    │
│ 2    ┆ 5.0  ┆ c    │
│ 2    ┆ 7.0  ┆ c    │
│ 13   ┆ 14.0 ┆ c    │
│ 12   ┆ 17.0 ┆ c    │
└──────┴──────┴──────┘

如果兩個dataframe不是升冪排序的話，會得到不合理的結果：

❗
df.merge_sorted(df2.sort("col3"), key="col3")

shape: (8, 3)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ f64  ┆ str  │
╞══════╪══════╪══════╡
│ 3    ┆ 6.0  ┆ a    │
│ 12   ┆ 15.0 ┆ a    │
│ 11   ┆ 16.0 ┆ b    │
│ 2    ┆ 5.0  ┆ c    │
│ 2    ┆ 7.0  ┆ c    │
│ 1    ┆ 4.0  ┆ b    │
│ 13   ┆ 14.0 ┆ c    │
│ 12   ┆ 17.0 ┆ c    │
└──────┴──────┴──────┘

4. `pl.Expr`

pl.Expr.sort()可以只針對單一expr進行排序。例如，僅針對「"col2"」列進行排序：

df.with_columns(pl.col("col2").sort().alias("sorted_col2"))

shape: (4, 4)
┌──────┬──────┬──────┬─────────────┐
│ col1 ┆ col2 ┆ col3 ┆ sorted_col2 │
│ ---  ┆ ---  ┆ ---  ┆ ---         │
│ i64  ┆ f64  ┆ str  ┆ f64         │
╞══════╪══════╪══════╪═════════════╡
│ 3    ┆ 6.0  ┆ a    ┆ 4.0         │
│ 2    ┆ 5.0  ┆ c    ┆ 5.0         │
│ 2    ┆ 7.0  ┆ c    ┆ 6.0         │
│ 1    ┆ 4.0  ┆ b    ┆ 7.0         │
└──────┴──────┴──────┴─────────────┘

pl.Expr.sort_by()可以只針對單一expr，以其它列的排序進行排序。例如，僅針對「"col2"」列，並以「"col3"」列排序進行排序：

(
    df.with_columns(
        pl.col("col2").sort_by("col3").alias("sorted_col2_by_col3")
    )
)

shape: (4, 4)
┌──────┬──────┬──────┬─────────────────────┐
│ col1 ┆ col2 ┆ col3 ┆ sorted_col2_by_col3 │
│ ---  ┆ ---  ┆ ---  ┆ ---                 │
│ i64  ┆ f64  ┆ str  ┆ f64                 │
╞══════╪══════╪══════╪═════════════════════╡
│ 3    ┆ 6.0  ┆ a    ┆ 6.0                 │
│ 2    ┆ 5.0  ┆ c    ┆ 4.0                 │
│ 2    ┆ 7.0  ┆ c    ┆ 5.0                 │
│ 1    ┆ 4.0  ┆ b    ┆ 7.0                 │
└──────┴──────┴──────┴─────────────────────┘

5. `codepanda`

Pandas可以針對索引或是列，分別使用.sort_index()及.sort_values()，進行pd.Series或pd.DataFrame的排序操作。而Polars則是提供了pl.Series、pl.DataFrame及pl.Expr各種資料結構的排序操作。由於pl.Expr.sort()是一種特別的操作，僅針對該expr進行排序，卻不影響其它列。對於來自Pandas的使用者，可能需要花上一些時間來適應。

此外，兩者慣用的排序參數名不一樣，Pandas使用ascending=True，而Polars使用descending=False。