iT邦幫忙

2025 iThome 鐵人賽

DAY 27
2

https://ithelp.ithome.com.tw/upload/images/20250902/20177885tHabwAAkv4.png

最後一個關於效能的主題,我們要來看的是在 PostgreSQL 中不同的 JOIN 方式。JOIN 是將多個表的資料關聯起來的方式,而 PostgreSQL 針對不同的查詢場景,提供了多種 JOIN 策略,包括 Nested Loop Join、Hash Join 和 Merge Join。不同的 JOIN 策略會影響查詢的效能。

之前在使用 EXPLAIN ANALAZE 的時候,可能就有觀察到不同 JOIN 出現在 query planner 中。今天的文章就會透過 EXPLAIN ANALYZE 來觀察不同 JOIN 的執行方式,並了解它們的適用情境。

Nested Loop Join

第一個要介紹的是 Nested Loop Join,如果是採用這個方式,PostgreSQL 會對 left attribute 逐行掃描,並使用 JOIN 條件去 right attribute 查找對應的值。這種方式對於小表 JOIN 大表且 right attribute 有 Index 的情境效率較好。

https://ithelp.ithome.com.tw/upload/images/20250820/201778858xupHFgQL0.png

觀察 Nested Loop Join

情境:小表 JOIN 大表且 right attribute 有 Index

  1. 建立兩個 table,departments 可以被當作小表,employees 是大表
CREATE TABLE departments (
	department_id SERIAL PRIMARY KEY,
	department_name VARCHAR(100),
	location VARCHAR(100)
);

CREATE TABLE employees (
	employee_id SERIAL PRIMARY KEY,
	department_id INT,
	first_name VARCHAR(50),
	last_name VARCHAR(50),
	email VARCHAR(100)
);
  1. 匯入資料 - departments (小表 5 筆資料)
departments = [
    ("Human Resources", "New York"),
    ("Engineering", "San Francisco"),
    ("Marketing", "Chicago"),
    ("Sales", "Los Angeles"),
    ("Finance", "Boston"),
]

cursor.executemany(
    """
    INSERT INTO departments (department_name, location)
    VALUES (%s, %s)
""",
    departments,
)
  1. 匯入資料 - employees(大表 10000 筆資料)
employee_data = []
for _ in range(10000):
    first_name = fake.first_name()
    last_name = fake.last_name()
    email = f"{first_name.lower()}.{last_name.lower()}@company.com"

    employee_data.append(
        (
            random.randint(1, 5),  # 隨機分配一個 department
            first_name,
            last_name,
            email,
        )
    )

cursor.executemany(
    """
    INSERT INTO employees (department_id, first_name, last_name, email)
    VALUES (%s, %s, %s, %s)
""",
    employee_data,
)
  1. 觀察 SQL
EXPLAIN ANALYZE
SELECT * FROM departments d
JOIN employees e ON d.department_id = e.department_id
WHERE d.department_id = 1;

https://ithelp.ithome.com.tw/upload/images/20250820/20177885K9Gp67VkI3.png

因為 departments 資料較少,employees 資料較多,employees 可以當成 left attribute,逐行掃描後用 department_id 去找對應的資料。

而且 departmentsdepartment_id 有 Index (Primary key),所以還可以看到是使用 Index Scan 讓查詢速度更快。也對應到文件上所說的,想象中逐一掃描 employees 的每一行對應 departments 的哪一個位置好像會很費時,但如果對應的欄位能夠 Index Scan,那這就是一個適合的策略。

The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)

Hash Join

第二種方法是 Hash Join ,Hash Join 會先掃描 right attribute,將 key-value 組成 Hash Table 儲存在記憶體中。接著掃描左表並使用 Hash Table 來查找對應值。通常發生在大表 JOIN 大表,但沒有 Index 的情況。

The right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

沒有 Index 所以將右表變成 Hash Table,左表在比對的時候才能找得比較快:

Key Value
1 {"department_id": 1, "department_name": "Human Resources", "location": "New York"}
2 {"department_id": 2, "department_name": "Engineering", "location": "San Francisco"}
3 {"department_id": 3, "department_name": "Marketing", "location": "Chicage"}
4 {"department_id": 4, "department_name": "Sales", "location": "Los Angeles"}
5 {"department_id": 5, "department_name": "Finance", "location": "Boston"}

觀察 Hash Join

情境:大表 JOIN 大表,但沒有 Index

  1. 建立兩個 table,並且都沒有設定任何 Index
CREATE TABLE orders(
	id INT,
	final_status TEXT
);

CREATE TABLE order_history(
	id INT,
	status TEXT
);
  1. 匯入資料,在 orders & order_history 都各塞入 10000 筆資料
# orders 塞入 10000 筆資料
customers = [
    (i, fake.random_element(elements=("pending", "completed", "cancelled")))
    for i in range(1, 10001)
]

cursor.executemany(
    """
    INSERT INTO orders (id, final_status) VALUES (%s, %s)
""",
    customers,
)

# order history 塞入 10000 筆資料
orders = [
    (
        i,
        fake.random_element(elements=("pending", "completed", "cancelled")),
    ) 
    for i in range(1, 10001)
]
cursor.executemany(
    """
    INSERT INTO order_history (id, status) VALUES (%s, %s)
""",
    orders,
)
  1. 觀察 SQL
EXPLAIN ANALYZE
SELECT *
FROM public.orders
JOIN public.order_history
ON orders.id = order_history.id;

https://ithelp.ithome.com.tw/upload/images/20250820/20177885epbbYa57od.png

因為 ordersorder_history 沒有 Index,因此 PostgreSQL 在這裡選擇 Hash Join。

Merge Join

第三個方法是 Merge Join ,Merge Join 會先對兩張 table JOIN 條件的欄位進行排序,然後比對及合併兩張表。適用於兩張表都已排序或有 Index、大表 JOIN 大表且有 Index。

Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.

https://ithelp.ithome.com.tw/upload/images/20250820/20177885g6bLwdeosq.png

觀察 Merge Join

情境:兩張表都已排序或有 Index、大表 JOIN 大表且有 Index

  1. 延續 Hash Join 的情境,我們試著將 Join 的欄位加上 Index
CREATE INDEX idx_orders_id ON orders(id);
CREATE INDEX idx_order_history_id ON order_history(id);
  1. 觀察 SQL
EXPLAIN ANALYZE
SELECT *
FROM public.orders
JOIN public.order_history
ON orders.id = order_history.id
ORDER BY orders.id, order_history.id; -- 加上排序的選項

https://ithelp.ithome.com.tw/upload/images/20250820/20177885ncu30IeGIL.png

發現這裡的 orders.idorder_history.id 都有 Index 後,資料庫能直接利用它來取得已排序的結果,所以接著只要再使用 Merge Join 來得到最終的資料。

重點回顧

  • Nested Loop Join 適合小表 JOIN 大表,且大表有 Index。
  • Hash Join 適合沒有 Index 的大表,利用 Hash Table 來加快比對速度。
  • Merge Join 適合有排序或 Index 的情境,可透過有順序地掃描提高效率。
  • 透過 EXPLAIN ANALYZE 來分析 query plan,可以了解 PostgreSQL 選擇的 JOIN 策略。

參考資料

https://www.postgresql.org/docs/current/planner-optimizer.html#PLANNER-OPTIMIZER-GENERATING-POSSIBLE-PLANS
https://gelovolro.medium.com/what-are-merge-join-hash-join-and-nested-loop-example-in-postgresql-29123ca18fd1


上一篇
Day 26 - Slow Query:找出拖慢系統的 SQL 查詢
系列文
PostgreSQL 效能優化 30 天挑戰:從 Index 到 Transaction 的深入探索27
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言