Day 20｜實戰 RAGAs：量化檢索與生成的表現

2025 iThome 鐵人賽

DAY 20

AI & Data

RAG × Agent：從知識檢索到智慧應用的30天挑戰系列第 20 篇

17th鐵人賽 llm rag ragas 實作

otterday

2025-10-04 00:34:27

201 瀏覽

分享至

前面指標能介紹的都介紹完了，今天就開始我們的實作吧！
要複習的話可以參考前幾天的內容，就讓我們開始ㄅ～

1. 安裝環境
這邊我們使用 Ollama + Mistral 做評分，不然原先是需要 OpenAI Key，還有我們會使用 HuggingFace Embeddings 用在 RAGAs。

# 評估核心
pip install ragas datasets evaluate

# 本地 LLM：Ollama 的 LangChain
pip install langchain-ollama

# 本地 Embeddings：HuggingFace 的 LangChain
pip install langchain-huggingface

2. 取得 RAG 輸出
先前的文章其實我們就有設定幾項參數(下面這幾項)，如果忘記了可以去回去一下之前實戰的內容~

query = "什麼是關鍵基礎設施？"
hits = search_chunks(query, k=4)                 
prompt = build_prompt(query, hits)               
answer = ask_ollama(prompt, model="mistral")

3. 建立評估資料庫
這邊會將那些指標也就是 Context Precision、Answer Relevance、Answer Faithfulness 需要的欄位加進去存成一個 dataset，其實這步驟你可以考慮不要有，這邊只是為了方便我們解釋才新增的，不過要丟甚麼參數進去評估自己可能要有清楚的條理。

from datasets import Dataset

# 取出檢索到的文字段落
contexts = [h["text"] for h in hits]

# 參考答案：手動撰寫或統整資料後放上去
reference = "關鍵基礎設施是指實體或虛擬資產、系統或網路，其功能一旦停止運作或效能降低，對國家安全、社會公共利益、國民生活或經濟活動有重大影響之虞，經主管機關定期檢視並公告之領域。"

data = {
    "question": [query],
    "contexts": [contexts],
    "answer":   [answer],
    "reference":[reference],
}
dataset = Dataset.from_dict(data)

4. 執行 RAGAS 評估
我們用 Ollama（Mistral）當評分 LLM，並用 HuggingFace Embeddings。
因為前面已經在用 sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 做檢索嵌入與 Chroma 建立資料庫，在 RAGAS 評估時也沿用同一款模型就好。

這邊針對 LLM 有特別做設定：

temperature=0 → 讓輸出穩定，避免隨機性。
num_ctx=4096 → 提高上下文長度上限，減少截斷。
這兩個設定可以有效降低出現 NaN 的機率。

這裡我們使用 evaluate() 來跑 RAGAS 指標。實際上要評估哪些項目，可以依需求挑選並放入 metrics。

from ragas.metrics import context_precision, context_recall
from ragas.metrics import faithfulness, answer_relevancy
from ragas import evaluate

# ollma 設定
from langchain_ollama import OllamaLLM
from ragas.llms import LangchainLLMWrapper
ollama_raw = OllamaLLM(model="mistral", temperature=0, num_ctx=4096)
ollama_llm  = LangchainLLMWrapper(ollama_raw)

# embeddings 設定
from langchain_huggingface import HuggingFaceEmbeddings as LCHuggingFaceEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
lc_hf = LCHuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
hf_embeddings = LangchainEmbeddingsWrapper(lc_hf)

# 評估
result = evaluate(
    dataset,
    metrics=[context_precision, context_recall, answer_relevancy, faithfulness],
    llm=ollama_llm,
    embeddings=hf_embeddings
)
print(result)

5. 輸出結果
{'context_precision': 0.7500, 'context_recall': 1.0000, 'answer_relevancy': 0.7336, 'faithfulness': 0.5000}
這四個分數各代表：