Day-15 | Postgres Query plan - join strategies - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2024 iThome 鐵人賽

DAY 15

Software Development

埋藏在後端工程下的地雷與寶藏系列第 15 篇

Day-15 | Postgres Query plan - join strategies

16th鐵人賽 database backend

Yusinz

團隊菜很長團

2024-09-29 21:16:17

373 瀏覽

分享至

今天我們延續昨天，接著來看看 join strategies 的部分

Nested Loop Join

Nested Loop Join 是所有策略中最簡單且最通用的一種。
在這種策略中，PostgreSQL 會依照順序掃描 outer relation，對於 outer relation 中的每一列結果，它會掃描 inner relation 來尋找對應的資料列。時間複雜度是 O(N*M)。

Nested Loop Join 中索引的作用
由於 outer relation 是依照順序掃描，因此 outer relation 上的索引不會對查詢有幫助。但如果在inner relation 的連接鍵上有索引，則可以大大加速 Nested Loop Join 的效率。
Nested Loop Join 的使用場景
當 outer relation 資料量較小時，速度會比較快，因為 inner relation 的循環次數不會太多。但當 outer relation 資料量較大時，即使 inner relation 上有索引，Nested Loop Join 的效率也不高，因此他較為適用資料量較小的時候。
一定會使用的情況
如果連接條件中不使用 = 運算符，Nested Loop Join 是唯一可以使用的策略。
因為其他兩個策略都無法使用。

我用兩張簡單的 table 來做示範
table 1

ithome=# select * from users
;
 id |  name   | age
----+---------+-----
  1 | Alice   |  30
  2 | Bob     |  25
  3 | Charlie |  35
  4 | David   |  28

table 2

ithome=# select * from orders;
 id | user_id | product_name |         order_date
----+---------+--------------+----------------------------
  1 |       1 | Laptop       | 2024-09-28 13:41:15.811864
  2 |       2 | Smartphone   | 2024-09-28 13:41:15.811864

發現就算是只有兩張簡單的 table ，Postgres 自己還是會使用 Hash join。可見 Nested Loop Join 真的基本到不行。

ithome=# EXPLAIN ANALYZE
SELECT u.name, o.product_name, o.order_date
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=1.09..2.20 rows=8 width=232) (actual time=0.414..0.425 rows=4 loops=1)
   Hash Cond: (o.user_id = u.id)
   ->  Seq Scan on orders o  (cost=0.00..1.08 rows=8 width=230) (actual time=0.114..0.115 rows=2 loops=1)
   ->  Hash  (cost=1.04..1.04 rows=4 width=10) (actual time=0.211..0.212 rows=4 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on users u  (cost=0.00..1.04 rows=4 width=10) (actual time=0.184..0.188 rows=4 loops=1)
 Planning Time: 1.370 ms
 Execution Time: 0.766 ms

但若是我們沒有將 key 透過 = 連接起來，就無法使用另外兩種策略，就會如以下的結果

ithome=# EXPLAIN ANALYZE
SELECT u.name, o.product_name, o.order_date
FROM users u
LEFT JOIN orders o ON u.name = 'Bob';
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..2.62 rows=8 width=232) (actual time=0.258..0.262 rows=5 loops=1)
   Join Filter: ((u.name)::text = 'Bob'::text)
   Rows Removed by Join Filter: 6
   ->  Seq Scan on users u  (cost=0.00..1.04 rows=4 width=6) (actual time=0.127..0.128 rows=4 loops=1)
   ->  Materialize  (cost=0.00..1.12 rows=8 width=226) (actual time=0.028..0.029 rows=2 loops=4)
         ->  Seq Scan on orders o  (cost=0.00..1.08 rows=8 width=226) (actual time=0.083..0.085 rows=2 loops=1)
 Planning Time: 0.981 ms
 Execution Time: 0.474 ms
(8 rows)

我們也可以設定禁用另外兩種來達到使用 Nested join

SET enable_hashjoin = off;
SET enable_mergejoin = off;

Hash Join

PostgreSQL 會順序掃描 inner relation，並基於所有使用 = 運算符的連接鍵建立一個雜湊表。接著，它會照順序掃描 outer relation，並對每一列的結果在雜湊表中進行配對，以尋找符合的 join key。

這種策略有點類似於 Nested Loop Join。只是多了一個建立雜湊表的步驟，但在雜湊表中查找資料的速度比遍歷 inner join 快得多。

Hash Join 中的索引作用
由於在 Hash Join 中，兩張 table 都是照順序掃描的方式進行，因此連接條件上的索引並不會對雜湊連接有幫助。
Hash Join 的使用場景
Hash Join 適用於參與 join 的 table 都不算小的情況。說是這樣說但我們可以看到剛剛的例子，就算一張 table 只有 4筆資料，另一張只有 2筆，Planner 還是選擇了 Hash join，可以看出他的效能之高。如果雜湊表過大，PostgreSQL 會將雜湊表分成多個批次，並儲存在磁碟中，就有可能會影響效能，這種情況下，Planner 可能會選擇其他的連接策略，比如 Merge Join。

需要注意的是，雜湊表的查找只適用於連接條件中包含 = 運算符的情況。

Merge join

PostgreSQL 會選擇所有使用 = 運算符的連接條件，並根據這些 join key 對兩個 table 進行排序，這代表這些 join key 需要可排序，接著它會同時遍歷兩個已排序的 table 已找出對應的欄位。

Merge join 索引的作用
如果 join key 上有索引，可以加速排序的過程。因此，兩個表的 join key 上都有索引的話，速度會更快。
Merge join 的使用場景
當 join 的表都非常大時，Hash join 需要多個雜湊表，進而可能影響效能，這時候 Merge join 就有機會出場了。因此，Merge join 是 join 大資料表時較好的選擇。

與 Hash join 類似，Merge join 只適用於至少有一個使用 = 運算符的連接條件。

以上就是 join strategies 的介紹!