【Day 14】資料 Chunking 與 Embedding 成本評估

2025 iThome 鐵人賽

DAY 14

AI & Data

Notion遇上LLM：30天打造我的AI知識管理系統系列第 14 篇

17th鐵人賽 embedding chunking openai text-embedding-3-small

Nikki Chen

團隊三陳牛肉吉事堡

2025-09-28 21:51:54

474 瀏覽

分享至

在 Day 13，我們討論了Chunking 策略，將 Notion 筆記切分成適合的文字片段，方便送進 Embedding 模型轉換成向量。
今天，我們要實作兩件事：

從 SQLite 撈資料並 Chunking
介紹 Embedding Model 並估算成本

1. 從 SQLite 撈資料

在 Day 11~Day 13，我們已經把 Notion 筆記寫進 SQLite (notion.db)，並依照 ERD 拆成三張表：

notion_databases
notion_pages
notion_blocks

其中，notion_blocks.block_text 是我們的主要目標，因為這些文字就是筆記的核心內容。
接下來，我們要將這些 block_text 撈出來，準備進行 Chunking。

1.1 程式碼`src/fetch_notion_blocks.py`

import sqlite3

def fetch_blocks(db_path="data/notion.db", limit=1000):
    conn = sqlite3.connect(db_path)
    cur = conn.cursor()

    cur.execute("""
        SELECT block_id, page_id, block_text
        FROM notion_blocks
        WHERE block_text IS NOT NULL AND TRIM(block_text) <> ''
        LIMIT ?
    """, (limit,))

    rows = cur.fetchall()
    conn.close()

    return [{"block_id": r[0], "page_id": r[1], "text": r[2]} for r in rows]

if __name__ == "__main__":
    blocks = fetch_blocks(limit=10)
    for b in blocks:
        print(b)

1.2 技術要點

連線與查詢
- 使用 Python 內建的 sqlite3 模組，連線到 data/notion.db。
- 查詢 notion_blocks 表，抓出 block_id, page_id, block_text。
- 過濾掉 NULL 或空字串，確保只處理有內容的區塊。
資料結構轉換
- 查詢結果 (fetchall) 會是 tuple list。
- 統一轉換成 Python dict，方便後續 Chunking 或 Embedding 使用：
```
{
  "block_id": "xxx",
  "page_id": "yyy",
  "text": "Block 文字內容"
}
```
模組化設計
- 封裝成 fetch_blocks() 函式，可被其他模組呼叫（例如 chunk_text.py）。
- 支援參數化 db_path 與 limit，提升靈活性。
測試與驗證
- if name == "main": 區塊提供快速測試。
- 預設輸出前 10 筆結果，方便驗證資料正確性。

2. Chunking 策略

在 Day 13，我們已經介紹過 Chunking 的原則：

以 Notion block 為基礎單位：避免斷句。
長文字再拆分：若 block > 500 字，就依長度切段。
重疊補充：每段保留 10~20% 重疊，避免語意斷裂。

2.1 程式碼 `src/chunk_block_text.py`

from fetch_notion_blocks import fetch_blocks

def chunk_text(text, chunk_size=800, overlap=100):
    """
    將文字切成 chunks，避免超過 Embedding token 限制。
    - chunk_size: 每段最大長度（字元數）
    - overlap: 每段之間的重疊，避免語意斷裂
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start = end - overlap  # 保留重疊
    return chunks


def fetch_and_chunk(db_path="data/notion.db", limit=100):
    """
    從 SQLite 撈出 blocks，再進行 chunking。
    輸出格式：
    [
      {
        "block_id": "...",
        "page_id": "...",
        "chunk_id": "block序號-第n段",
        "text": "切割後的內容"
      }
    ]
    """
    blocks = fetch_blocks(db_path=db_path, limit=limit)
    all_chunks = []

    for b in blocks:
        pieces = chunk_text(b["text"])
        for idx, p in enumerate(pieces):
            all_chunks.append({
                "block_id": b["block_id"],
                "page_id": b["page_id"],
                "chunk_id": f"{b['block_id']}-{idx}",
                "text": p
            })

    return all_chunks


if __name__ == "__main__":
    chunks = fetch_and_chunk(limit=10)
    for c in chunks:
        print(c["chunk_id"], c["block_id"], c["text"][:80], "...")

2.2 技術要點說明

chunk_text()：切割長文字
- 切割長文字，每段 chunk_size=800。
- 保留 overlap=100，相鄰 chunk 之間的重疊區，避免句子被硬切斷後失去語意連貫。
fetch_and_chunk()：從 DB 撈資料並切割
- 流程設計：
  - 呼叫 fetch_blocks()，從 SQLite 撈出 notion_blocks 的文字。
  - 對每個 block 的 text 呼叫 chunk_text() 進行切割。
  - 產生 chunk_id：用 block_id-序號 來標示，方便追蹤。
- 輸出格式：
```
{
  "block_id": "block123",
  "page_id": "page123",
  "chunk_id": "block123-0",
  "text": "切割後的內容..."
}
```
  - 每個 block 可能會對應到多個 chunk。

3. Embedding 模型與成本

3.1 Model 介紹 `text-embedding-3-small`

OpenAI 在 2024 年初推出了 text-embedding-3 系列，主要有兩種規格：

text-embedding-3-small：維度 1,536，速度快、成本低，非常適合個人專案與知識檢索。
text-embedding-3-large：維度 3,072，語意捕捉更細緻，適合需要極高準確度的大型應用（成本也較高）。

我們選擇 text-embedding-3-small，理由如下：

維度與效能的平衡
- 1,536 維度：已經足夠捕捉語意，能應付常見的搜尋、問答、分類等任務。
- 對 Notion 筆記這種知識管理情境，並不需要過高的維度，否則反而造成儲存空間增加與檢索速度變慢。
成本極低
- 定價：$0.02 / 1M tokens。
- 換算下來，即使有數萬個 chunk，每個 chunk 幾百字，總成本通常還不到 1 美元，非常適合長期運行的個人知識庫。
適合語意檢索與 RAG
- text-embedding-3-small 的語意理解力，足以處理：
  - 相似度檢索：找出最接近的筆記段落。
  - 語意分類：把 chunks 分門別類（學習 / 旅遊 / 專案）。
  - RAG（Retrieval-Augmented Generation）：作為 LLM 的外部知識庫，幫助生成更精準的回答。
社群與範例支援多
- 幾乎所有開源框架（LangChain、LlamaIndex、Chroma、Weaviate 等）都內建對 text-embedding-3-small 的支援。
- 意味著我們能輕鬆把它嵌入到 Pipeline，而不用額外處理格式。

3.2 成本計算

我們選擇了 OpenAI 目前性價比最高的 text-embedding-3-small 模型，它的定價是每百萬 tokens收費**$0.02**美元。
百萬 tokens 是什麼概念？大約是 75 萬個英文字，或 30-50 萬個中文字。聽起來很多，但實際費用呢？我們寫個小工具來算算看。

3.2.1 程式碼 `src/calc_embedding_cost.py`

import sqlite3

def fetch_block_lengths(db_path="data/notion.db", limit=1000):
    """撈出 block_text 並計算長度（粗估 token 用字數替代）。"""
    conn = sqlite3.connect(db_path)
    cur = conn.cursor()
    cur.execute("""
        SELECT block_text
        FROM notion_blocks
        WHERE block_text IS NOT NULL AND TRIM(block_text) <> ''
        LIMIT ?
    """, (limit,))
    rows = cur.fetchall()
    conn.close()

    return [len(r[0]) for r in rows if r[0]]


def calc_embedding_cost(num_chunks: int, avg_tokens_per_chunk: int, rate_per_million_tokens: float = 0.02):
    """估算 OpenAI Embedding API 的總成本。"""
    total_tokens = num_chunks * avg_tokens_per_chunk
    cost_usd = (total_tokens / 1_000_000) * rate_per_million_tokens
    return total_tokens, cost_usd


if __name__ == "__main__":
    lengths = fetch_block_lengths(limit=5000)

    # 假設字數 ≈ token 數，這裡簡單用 1 char ≈ 1 token（保守估算）
    total_tokens = sum(lengths)
    avg_tokens_per_chunk = int(total_tokens / len(lengths)) if lengths else 0

    total_tokens, cost_usd = calc_embedding_cost(len(lengths), avg_tokens_per_chunk)
    cost_twd = cost_usd * 31  # 假設匯率 1 USD = 31 TWD
    print(f"總 Chunks 數量: {len(lengths)}")
    print(f"平均 Tokens/Chunk: {avg_tokens_per_chunk}")
    print(f"總 Token 數量: {total_tokens:,}")
    print(f"預估成本: ${cost_usd:.4f} USD (約 {cost_twd:.2f} TWD)")

3.2.2 技術要點

fetch_block_lengths
- 直接從 notion_blocks.block_text 撈資料。
- 計算每段文字的長度（字數），簡化近似為 Token 數。
calc_embedding_cost
- 計算公式：total_tokens / 1,000,000 * 單價
- 預設 text-embedding-3-small 單價為 $0.02 / 百萬 tokens。
實際執行
- 可快速得出：總 chunks、平均 tokens、總 tokens 與對應成本。