【Day 16】從 Chunk 到向量：將 Notion 筆記寫入 Chroma DB

2025 iThome 鐵人賽

DAY 16

AI & Data

Notion遇上LLM：30天打造我的AI知識管理系統系列第 16 篇

17th鐵人賽 notion api sqlite chromadb openai

Nikki Chen

團隊三陳牛肉吉事堡

2025-09-30 22:59:42

463 瀏覽

分享至

在 Day 15，我們完成了 OpenAI API Key 的設定與 Chroma DB 的初始化，今天，我們要進行一個重要里程碑：
把 Notion 筆記的 chunks 送進 OpenAI 的 Embedding 模型 (text-embedding-3-small)，轉換成向量，並寫入 Chroma DB。
完成這一步後，我們的知識庫就具備了語意檢索能力 —— 不再依靠死板的字詞比對，而是能用語意搜尋筆記內容。

1. 實作向量化

今天的任務分成四步：

從 SQLite 撈出 Notion blocks，並進行 Chunking。
呼叫 OpenAI 的 Embedding API，把文字轉換成向量。
初始化 Chroma DB，建立一個 notion_notes Collection。
把 chunks + 向量 + metadata 寫入 Collection。

1.1 程式碼 `src/embed_notion_chunks.py`

fetch_notion_blocks.py 及 chunk_block_text.py 請參考【Day 14】資料 Chunking 與 Embedding 成本評估說明

import os
import sqlite3
from dotenv import load_dotenv
from openai import OpenAI
import chromadb
from fetch_notion_blocks import fetch_blocks
from chunk_block_text import chunk_text

# --- 1. 載入環境變數與初始化 OpenAI ---
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# --- 2. 初始化 Chroma DB ---
db_path = "db/chroma_db"
chroma_client = chromadb.PersistentClient(path=db_path)
collection = chroma_client.get_or_create_collection("notion_notes")

# --- 3. 撈資料並進行 Chunking ---
def fetch_and_chunk(db_path="data/notion.db", limit=50):
    blocks = fetch_blocks(db_path=db_path, limit=limit)
    all_chunks = []

    for b in blocks:
        pieces = chunk_text(b["text"])
        for idx, p in enumerate(pieces):
            all_chunks.append({
                "block_id": b["block_id"],
                "page_id": b["page_id"],
                "chunk_id": f"{b['block_id']}-{idx}",
                "text": p
            })
    return all_chunks

# --- 4. 呼叫 OpenAI Embedding API ---
def embed_texts(texts, model="text-embedding-3-small"):
    response = client.embeddings.create(
        model=model,
        input=texts
    )
    return [d.embedding for d in response.data]

# --- 5. 主程式：chunks -> embeddings -> Chroma DB ---
if __name__ == "__main__":
    chunks = fetch_and_chunk(limit=20)
    texts = [c["text"] for c in chunks]
    ids = [c["chunk_id"] for c in chunks]
    metadatas = [{"block_id": c["block_id"], "page_id": c["page_id"]} for c in chunks]

    print(f"準備寫入 {len(chunks)} 筆 chunks...")

    # 產生 embeddings
    embeddings = embed_texts(texts)

    # 寫入 Chroma DB
    collection.add(
        ids=ids,
        documents=texts,
        metadatas=metadatas,
        embeddings=embeddings
    )

    print(f"✅ 已成功寫入 {len(chunks)} 筆 chunks 到 Collection 'notion_notes'")
    print(f"目前 Collection 總數：{collection.count()} 筆")

Output:

準備寫入 20 筆 chunks...
✅ 已成功寫入 20 筆 chunks 到 Collection 'notion_notes'
目前 Collection 總數：20 筆

1.2 技術要點

Embedding API
- 使用 text-embedding-3-small，1,536 維度。
- client.embeddings.create(model, input=texts) 支援一次傳多個文字。
Chroma DB Collection
- documents: 原始文字內容。
- metadatas: 來源資訊（block_id, page_id）。
- ids: 唯一 ID（這裡用 block_id + chunk 序號）。
- embeddings: OpenAI 轉換出的向量。
效能建議
- 每次批量呼叫 Embedding（例如一次處理 20 筆），比逐筆送 API 更快。
- 若 chunks 數量大，可分批處理再寫入。

2. 驗證 ChromaDB

到目前為止，我們已經將 Notion 筆記 chunks 寫入向量資料庫。接下來就需要來驗證資料是否正確寫入 ChromaDB。

2.1 驗證 Collection 是否存在

ChromaDB 中，Collection 相當於一張「資料表」，存放一組相關的向量。透過 list_collections() 可以確認我們的 Collection 是否存在，以及裡面有多少筆資料。

程式碼 src/verify_chroma.py

import chromadb

def verify_collections(db_path="db/chroma_db"):
    client = chromadb.PersistentClient(path=db_path)

    print("目前 ChromaDB 的 Collections：")
    for col in client.list_collections():
        print(f"- Name: {col.name}, ID: {col.id}, Records: {col.count()}")

if __name__ == "__main__":
    verify_collections()

Output:

目前 ChromaDB 的 Collections：
- Name: notion_notes, ID: c7080edaXXXXXXX, Dimension: 1536

如果能看到 notion_notes，代表我們的 Notion chunks 已經寫進 ChromaDB。

2.2 語意查詢

有了 Collection，我們就能執行語意查詢。例如，問：「什麼是物件導向？」讓 Chroma 找出最相關的 chunks。
為了保持一致性，在 query 前先用 OpenAI 生成 embedding，再交給 Chroma。

import os
import chromadb
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
client_oa = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

def query_chroma(db_path="db/chroma_db", collection_name="notion_notes"):
    # 初始化 Chroma client
    client = chromadb.PersistentClient(path=db_path)
    collection = client.get_or_create_collection(name=collection_name)

    # 1. 先生成 embedding (1536 維)
    query_text = "什麼是物件導向？"
    embedding = client_oa.embeddings.create(
        model="text-embedding-3-small",
        input=query_text
    ).data[0].embedding

    # 2. 用 embedding 查詢
    results = collection.query(
        query_embeddings=[embedding],
        n_results=3
    )

    print("\n--- 查詢結果 ---")
    for i, doc in enumerate(results["documents"][0]):
        metadata = results["metadatas"][0][i]
        print(f"結果 {i+1}:")
        print(f"  - 內容: {doc[:100]}...")
        print(f"  - 元數據: {metadata}")

if __name__ == "__main__":
    query_chroma()

Output:

--- 查詢結果 ---
結果 1:
  - 內容: self.屬性可以用來存取物件的屬性。...
  - 元數據: {'block_id': 'b01', 'page_id': 'p01'}
結果 2:
  - 內容: 在用類別建立物件時，若希望能同時指定物件的初始值，可將要定義的attribute集中在__int__()這個初始化的method內，那麼在建立物件時，系統就會自動呼叫__int__()，並用傳入的參數...
  - 元數據: {'page_id': 'p02', 'block_id': 'b01'}
結果 3:
  - 內容: 類別就像是物件的設計藍圖，可以產出具有相似特性的物件。由Attribute(屬性)與Method(方法)繪製而成。...
  - 元數據: {'page_id': 'p03', 'block_id': 'b01'}