Day18 把 embeddings 存入向量資料庫，實作簡單檢索

2025 iThome 鐵人賽

DAY 18

佛心分享-IT 人自學之術

學習 LLM系列第 18 篇

17th鐵人賽

yu_ting

2025-10-02 15:17:12

142 瀏覽

分享至

流程 :

有 embeddings（N × d 的 numpy.float32 陣列）和對應 ids（list of str）
把 embeddings 加到向量索引（FAISS / Chroma） → 建好 index
實作 query_to_topk(query, k)：把 query 編碼成向量 → 檢索 top-k → 回傳 ids、分數、原始文本

一、前置

# 共用前置：確認 embeddings 與 ids
import os, json, numpy as np, pandas as pd


# 載入 faqs.csv
df = pd.read_csv("faqs.csv", encoding="utf-8-sig")  # 確保檔案存在
print("df shape:", df.shape)


# 載入 embeddings 與 ids
if os.path.exists("faq_question_embeddings.npy"):
    embeddings = np.load("faq_question_embeddings.npy")
    print("Loaded embeddings shape:", embeddings.shape)
else:
    print("找不到 faq_question_embeddings.npy，請先執行 embedding 生成步驟。")


if os.path.exists("faq_ids.json"):
    with open("faq_ids.json","r",encoding="utf-8") as f:
        ids = json.load(f)
    print("Loaded ids:", len(ids))
else:
    # 預設用 df 的 id 欄位（index to str mapping）
    ids = df["id"].astype(str).tolist()
    print("Using df ids:", len(ids))

結果 :
df shape: (10, 3)
Loaded embeddings shape: (10, 384)
Loaded ids: 10

二、FAISS 路線

# 假設 embeddings 已定義，dtype=float32
embeddings = embeddings.astype("float32")
N, d = embeddings.shape


# --- 用 cosine similarity: 先做 L2 normalize，再用 IndexFlatIP (inner product) ---
faiss.normalize_L2(embeddings)                 # 把每筆向量做 L2 正規化 -> 用 inner product 等於 cosine


index = faiss.IndexFlatIP(d)                   # exact inner-product index
index.add(embeddings)                          # 加入所有向量
print("Index ntotal:", index.ntotal)           # 應該等於 N


# 儲存 & 載入
faiss.write_index(index, "faiss_index.bin")
# index = faiss.read_index("faiss_index.bin")  # 載入範例


# 檢索 function
def faiss_search(query_emb, k=3, return_scores=True):
    """
    query_emb: numpy array shape (1, d) float32 (已 normalize if IndexFlatIP used with normalize)
    returns: list of dicts: [{id, idx, score}, ...]
    """
    # 若還沒 normalize: faiss.normalize_L2(query_emb)
    D, I = index.search(query_emb, k)   # D: scores, I: indices
    results = []
    for score, idx in zip(D[0], I[0]):
        results.append({
            "id": ids[idx],                   # 以 ids list 對應
            "index": int(idx),
            "score": float(score)
        })
    return results


# 使用範例：先 encode query (用跟你生成 embeddings 相同的 embedder)
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
query = "我要退貨怎麼做？"
q_emb = embedder.encode([query], convert_to_numpy=True).astype("float32")
faiss.normalize_L2(q_emb)
res = faiss_search(q_emb, k=3)
print(res)
# 會有原文
for r in res:
    print(r["score"], df[df["id"]==r["id"]]["question"].values[0], df[df["id"]==r["id"]]["answer"].values[0])

結果 :
Index ntotal: 10
[{'id': 'q1', 'index': 0, 'score': 0.8171406984329224}, {'id': 'q4', 'index': 3, 'score': 0.5553216934204102}, {'id': 'q10', 'index': 9, 'score': 0.3988548517227173}]
0.8171406984329224 如何申請退貨？請於訂單頁點選退貨申請並上傳商品照片，客服將於 3 個工作天內處理。
0.5553216934204102 付款方式有哪些？我們支援信用卡、LINE Pay 與貨到付款。
0.3988548517227173 如何使用優惠券？在結帳頁面輸入優惠碼，系統會自動折抵。

三、Threshold / top-k 處理

def retrieve_topk(query, k=3, score_threshold=0.35):
    q_emb = embedder.encode([query], convert_to_numpy=True).astype("float32")
    faiss.normalize_L2(q_emb)
    res = faiss_search(q_emb, k=k)
    # 過濾 threshold（cosine score）
    filtered = [r for r in res if r['score'] >= score_threshold]
    return filtered


print(retrieve_topk("寄件速度多久？", k=3, score_threshold=0.3))

結果 :
[{'id': 'q5', 'index': 4, 'score': 0.5406196117401123}, {'id': 'q3', 'index': 2, 'score': 0.3819848895072937}, {'id': 'q7', 'index': 6, 'score': 0.37685006856918335}]

四、Chroma 路線 - 建立 collection、加入、查詢

# =========================
# 安裝所需套件
# =========================
!pip install -q chromadb sentence-transformers pandas


# =========================
# 匯入套件
# =========================
import chromadb
from sentence_transformers import SentenceTransformer
import pandas as pd




# =========================
# 初始化 Chroma
# =========================
client = chromadb.PersistentClient(path="./chroma_db")


collection_name = "faq_collection"
try:
    collection = client.get_collection(collection_name)
except Exception:
    collection = client.create_collection(name=collection_name)




# =========================
# 查詢範例（簡單語義搜尋）
# =========================
query = "我想退東西怎麼辦？"
q_emb = embedder.encode([query], convert_to_numpy=True).tolist()


res = collection.query(
    query_embeddings=q_emb,
    n_results=3,
    include=["documents", "metadatas", "distances"]  
)


# =========================
# 列印結果的函式
# =========================
def pretty_print_chroma_query_result(res):
    ids_matrix = res.get("ids")
    docs_matrix = res.get("documents")
    metas_matrix = res.get("metadatas")
    dists_matrix = res.get("distances")


    ids_row = ids_matrix[0] if ids_matrix else []
    docs_row = docs_matrix[0] if docs_matrix else []
    metas_row = metas_matrix[0] if metas_matrix else []
    dists_row = dists_matrix[0] if dists_matrix else []


    k = max(len(ids_row), len(docs_row), len(metas_row), len(dists_row))


    print(f"\n🔎 Query Result (Top {k})")
    for i in range(k):
        print(f"\n--- Rank {i+1} ---")
        if ids_row and i < len(ids_row):
            print("ID:", ids_row[i])
        if docs_row and i < len(docs_row):
            print("Document:", docs_row[i])
        if metas_row and i < len(metas_row):
            print("Metadata:", metas_row[i])
        if dists_row and i < len(dists_row):
            print("Distance:", dists_row[i])


# =========================
# 輸出查詢結果
# =========================
pretty_print_chroma_query_result(res)

結果 :
🔎 Query Result (Top 3)

--- Rank 1 ---
ID: 1
Document: 我要怎麼退貨？
Metadata: {'answer': '請至訂單頁面申請退貨，我們會安排取件。', 'question': '我要怎麼退貨？', 'id': '1'}
Distance: 5.874162673950195

--- Rank 2 ---
ID: q1
Document: 如何申請退貨？
Metadata: {'question': '如何申請退貨？', 'answer': '請於訂單頁點選退貨申請並上傳商品照片，客服將於 3 個工作天內處理。'}
Distance: 10.592032432556152

--- Rank 3 ---
ID: q10
Document: 如何使用優惠券？
Metadata: {'answer': '在結帳頁面輸入優惠碼，系統會自動折抵。', 'question': '如何使用優惠券？'}
Distance: 12.331013679504395