Day 27 - PostgreSQL 的 JOIN 機制 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 27

Software Development

PostgreSQL 效能優化 30 天挑戰：從 Index 到 Transaction 的深入探索系列第 27 篇

Day 27 - PostgreSQL 的 JOIN 機制

17th鐵人賽

claudiateng

團隊籠中鳥

2025-09-27 08:57:37

124 瀏覽

分享至

最後一個關於效能的主題，我們要來看的是在 PostgreSQL 中不同的 JOIN 方式。JOIN 是將多個表的資料關聯起來的方式，而 PostgreSQL 針對不同的查詢場景，提供了多種 JOIN 策略，包括 Nested Loop Join、Hash Join 和 Merge Join。不同的 JOIN 策略會影響查詢的效能。

之前在使用 EXPLAIN ANALAZE 的時候，可能就有觀察到不同 JOIN 出現在 query planner 中。今天的文章就會透過 EXPLAIN ANALYZE 來觀察不同 JOIN 的執行方式，並了解它們的適用情境。

Nested Loop Join

第一個要介紹的是 Nested Loop Join，如果是採用這個方式，PostgreSQL 會對 left attribute 逐行掃描，並使用 JOIN 條件去 right attribute 查找對應的值。這種方式對於小表 JOIN 大表且 right attribute 有 Index 的情境效率較好。

觀察 Nested Loop Join

情境：小表 JOIN 大表且 right attribute 有 Index

建立兩個 table，departments 可以被當作小表，employees 是大表

CREATE TABLE departments (
	department_id SERIAL PRIMARY KEY,
	department_name VARCHAR(100),
	location VARCHAR(100)
);

CREATE TABLE employees (
	employee_id SERIAL PRIMARY KEY,
	department_id INT,
	first_name VARCHAR(50),
	last_name VARCHAR(50),
	email VARCHAR(100)
);

匯入資料 - departments （小表 5 筆資料）

departments = [
    ("Human Resources", "New York"),
    ("Engineering", "San Francisco"),
    ("Marketing", "Chicago"),
    ("Sales", "Los Angeles"),
    ("Finance", "Boston"),
]

cursor.executemany(
    """
    INSERT INTO departments (department_name, location)
    VALUES (%s, %s)
""",
    departments,
)

匯入資料 - employees（大表 10000 筆資料）

employee_data = []
for _ in range(10000):
    first_name = fake.first_name()
    last_name = fake.last_name()
    email = f"{first_name.lower()}.{last_name.lower()}@company.com"

    employee_data.append(
        (
            random.randint(1, 5),  # 隨機分配一個 department
            first_name,
            last_name,
            email,
        )
    )

cursor.executemany(
    """
    INSERT INTO employees (department_id, first_name, last_name, email)
    VALUES (%s, %s, %s, %s)
""",
    employee_data,
)

觀察 SQL

EXPLAIN ANALYZE
SELECT * FROM departments d
JOIN employees e ON d.department_id = e.department_id
WHERE d.department_id = 1;

因為 departments 資料較少，employees 資料較多，employees 可以當成 left attribute，逐行掃描後用 department_id 去找對應的資料。

而且 departments 的 department_id 有 Index (Primary key)，所以還可以看到是使用 Index Scan 讓查詢速度更快。也對應到文件上所說的，想象中逐一掃描 employees 的每一行對應 departments 的哪一個位置好像會很費時，但如果對應的欄位能夠 Index Scan，那這就是一個適合的策略。

The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)

Hash Join

第二種方法是 Hash Join ，Hash Join 會先掃描 right attribute，將 key-value 組成 Hash Table 儲存在記憶體中。接著掃描左表並使用 Hash Table 來查找對應值。通常發生在大表 JOIN 大表，但沒有 Index 的情況。

The right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.

沒有 Index 所以將右表變成 Hash Table，左表在比對的時候才能找得比較快：

Key	Value
1	`{"department_id": 1, "department_name": "Human Resources", "location": "New York"}`
2	`{"department_id": 2, "department_name": "Engineering", "location": "San Francisco"}`
3	`{"department_id": 3, "department_name": "Marketing", "location": "Chicage"}`
4	`{"department_id": 4, "department_name": "Sales", "location": "Los Angeles"}`
5	`{"department_id": 5, "department_name": "Finance", "location": "Boston"}`

觀察 Hash Join

情境：大表 JOIN 大表，但沒有 Index

建立兩個 table，並且都沒有設定任何 Index

CREATE TABLE orders(
	id INT,
	final_status TEXT
);

CREATE TABLE order_history(
	id INT,
	status TEXT
);

匯入資料，在 orders & order_history 都各塞入 10000 筆資料

# orders 塞入 10000 筆資料
customers = [
    (i, fake.random_element(elements=("pending", "completed", "cancelled")))
    for i in range(1, 10001)
]

cursor.executemany(
    """
    INSERT INTO orders (id, final_status) VALUES (%s, %s)
""",
    customers,
)

# order history 塞入 10000 筆資料
orders = [
    (
        i,
        fake.random_element(elements=("pending", "completed", "cancelled")),
    ) 
    for i in range(1, 10001)
]
cursor.executemany(
    """
    INSERT INTO order_history (id, status) VALUES (%s, %s)
""",
    orders,
)

觀察 SQL

EXPLAIN ANALYZE
SELECT *
FROM public.orders
JOIN public.order_history
ON orders.id = order_history.id;

因為 orders 和 order_history 沒有 Index，因此 PostgreSQL 在這裡選擇 Hash Join。

Merge Join

第三個方法是 Merge Join ，Merge Join 會先對兩張 table JOIN 條件的欄位進行排序，然後比對及合併兩張表。適用於兩張表都已排序或有 Index、大表 JOIN 大表且有 Index。

Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.

觀察 Merge Join

情境：兩張表都已排序或有 Index、大表 JOIN 大表且有 Index

延續 Hash Join 的情境，我們試著將 Join 的欄位加上 Index

CREATE INDEX idx_orders_id ON orders(id);
CREATE INDEX idx_order_history_id ON order_history(id);

觀察 SQL

EXPLAIN ANALYZE
SELECT *
FROM public.orders
JOIN public.order_history
ON orders.id = order_history.id
ORDER BY orders.id, order_history.id; -- 加上排序的選項