Day27 GAI爆炸時代 - Ragas介紹 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2024 iThome 鐵人賽

DAY 27

生成式 AI

LLM 應用、開發框架、RAG優化及評估方法系列第 27 篇

Day27 GAI爆炸時代 - Ragas介紹

16th鐵人賽

wow_ppwx

2024-09-03 00:24:21

1778 瀏覽

分享至

Ragas介紹

RAGAS（Retrieval-Augmented Generation Assessment System）是一個專門用來評估RAG（Retrieval-Augmented Generation）系統性能的工具，旨在提供全面和準確的測量，以幫助開發者和研究人員理解和改進RAG系統的表現。以下是RAGAS的詳細介紹：

1. RAGAS的設計目標

RAGAS的設計目標是為RAG系統提供一個標準化的評估框架，這個框架可以有效地衡量系統在各個方面的表現，如回應的準確性、相關性、流暢性和語義相似度等。透過這樣的評估，開發者可以更好地調整和優化系統，從而提升用戶體驗和系統的實用性。

2. RAGAS的主要功能

RAGAS提供了多樣化的功能，這些功能主要集中在以下幾個方面：

a. 多維度評估

RAGAS支持從多個維度對RAG系統進行評估，每個維度都代表了RAG系統性能的不同方面：

語義相似度（Semantic Similarity）：評估生成回應與參考答案之間的語義相似度。這個維度通常使用基於語義嵌入（Semantic Embedding）的方法來衡量，例如使用BERT等預訓練模型來計算兩個句子的餘弦相似度。
精確性（Accuracy）：評估生成的回答在信息層面的正確性。這個維度重點考慮回答內容是否符合問題的事實性和邏輯性。
相關性（Relevance）：評估生成的回答與輸入問題之間的相關程度。高相關性意味著生成的回答能夠直接解決或回應用戶提出的問題。
流暢性（Fluency）：評估生成的回答在語言表達上的自然程度，主要考察句子的語法結構和用詞是否符合語言規範。
涵蓋率（Coverage）：評估生成的回答對於問題涉及信息的涵蓋範圍，確保回答能夠全面且完整地回應用戶需求。

b. 語義相似度

語義相似度是RAGAS中的一個關鍵評估指標。RAGAS會計算生成的回答與標準答案之間的語義相似度，這通常通過語義嵌入技術來實現。例如，RAGAS可能會使用BERT或其他語言模型來將句子轉換為向量，然後計算這些向量之間的餘弦相似度。這種方法能夠比傳統的詞語匹配方法更好地捕捉到回答的語義內容。

c. 自動化評估

RAGAS支持大規模的自動化評估，這意味著它可以處理大量的測試樣本，並生成詳細的評估報告。這一特性對於需要頻繁調整和測試RAG系統的開發者來說非常重要，因為它能夠顯著縮短測試時間並提高測試效率。

d. 靈活性與可配置性

RAGAS設計為一個靈活且可配置的系統。用戶可以根據自身的需求自定義評估指標和權重，從而更好地適應不同應用場景的需求。例如，某些應用可能更加重視回答的精確性，而另一些應用則可能更關注回答的流暢性和語義相似度。RAGAS允許用戶根據這些需求進行配置，確保評估結果與實際需求一致。

3. RAGAS的應用場景

RAGAS適用於各種涉及信息檢索和生成的應用場景，如：

問答系統：RAGAS可以用來評估問答系統生成回答的質量，幫助優化回答的準確性和相關性。
聊天機器人：RAGAS可以評估聊天機器人在對話中的表現，確保回應的自然性和語義的一致性。
內容生成：在內容生成應用中，RAGAS可以幫助評估生成內容的流暢性和主題一致性，從而提高內容的可讀性和用戶滿意度。

4. RAGAS的優勢

RAGAS相較於傳統的評估方法，具有以下優勢：

全面性：它考慮了RAG系統的多個方面，提供了全方位的評估指標。
自動化和高效性：RAGAS支持自動化處理大量數據，並能快速生成評估報告，節省了時間和人力成本。
靈活配置：RAGAS允許用戶根據具體應用場景定制評估標準，具有很高的靈活性。

5. 未來展望

隨著RAG技術的發展，RAGAS也將不斷更新和完善，以適應不斷變化的技術需求。未來的RAGAS可能會引入更多的智能評估功能，如自動錯誤分析、生成內容的情感分析等，進一步提升其評估的準確性和實用性。

評估指標

官方文件:https://docs.ragas.io/en/latest/concepts/metrics/index.html

Ragas 提供許多評估指標供使用，可依照情境決定要套用哪個，這邊主要介紹此圖的六個指標
這邊來提供快速整理供記憶，細節就請各位去看一下官方文件啦

context_presion: user query跟參考的檔案是否相關
context_recall: 正解跟參考檔案關係
answer_relevancy: user query跟AI回應的關聯
faithfulness: 參考檔案跟AI回應關聯
answer_correctness、answer_similarity: AI回應跟正解關聯

Faithfulness

參考的檔案跟AI回應關，數值介於0~1之間，數值越高，代表回答得越好

from datasets import Dataset 
from ragas.metrics import faithfulness
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness])
score.to_pandas()

answer_relevancy

user query跟AI回應的關聯，較高的分數代表回答的相關程度很高

from datasets import Dataset 
from ragas.metrics import answer_relevancy
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_relevancy])
score.to_pandas()

Context Precision

user query跟參考的檔案是否相關

from datasets import Dataset 
from ragas.metrics import context_precision
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[context_precision])
score.to_pandas()

context_recall

正解跟參考檔案關係

from datasets import Dataset 
from ragas.metrics import context_recall
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[context_recall])
score.to_pandas()

answer_correctness

AI回應跟正解關聯

TP（True Positive）：在標準答案和生成答案中都存在的事實。
FP（False Positive）：在生成答案中存在但不在標準答案中的事實。
FN（False Negative）：在標準答案中存在但不在生成答案中的事實。

這邊是改良版的F1 Score，此公式的目的在於同時考慮TP、FP和FN，但在計算時給予了FP和FN較低的權重（0.5），可減少它們對最終F1 Score的影響。優點：

減少懲罰：如果FP和FN的數量較多，傳統的F1 Score會很低。通過將FP和FN的影響減半，可以避免因少數錯誤而過度懲罰模型。
平衡權重：0.5的權重讓這個計算方式在強調正確回答（TP）的同時，不會過於嚴苛地懲罰錯誤回答。

這種方法在一些特定情境下可能會提供更合理的評估，尤其是當我們希望對模型的錯誤寬容一些時

事實相似度：生成答案與標準答案之間的事實重疊。

使用指定的嵌入模型將標準答案和生成答案向量化，然後計算這兩個向量之間的餘弦相似度

這兩個方面通過加權計算得到最終的Answer Correctness分數。用戶還可以使用“閾值”將結果四捨五入為二進制值（即正確或不正確）

from datasets import Dataset 
from ragas.metrics import faithfulness, answer_correctness
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_correctness])
score.to_pandas()

Answer semantic similarity

AI回應跟正解關聯

from datasets import Dataset 
from ragas.metrics import answer_similarity
from ragas import evaluate


data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_similarity])
score.to_pandas()

最後，這邊提供整合多個評股指標的範例程式碼:

from datasets import Dataset
from ragas import evaluate
from langchain.schema import SystemMessage, HumanMessage
from ragas.metrics import (
    context_precision,
    answer_relevancy,
    faithfulness,
    context_recall,
    answer_correctness
)
from ragas.metrics.critique import harmfulness
from ragas.run_config import RunConfig

questions = [
    "問題1",
    "問題2"
]

ground_truths = [
    "正解1",
    "正解2"
]


def summary_chain(query: str):
    top_k = retrieve(query) # 回傳原始chunk的content
    related_chunks = "\n\n".join([doc.page_content for doc in top_k])
    messages = [
        SystemMessage(content="你是一個非常了解銀行法規相關資訊的人"),
        HumanMessage(content=f"請根據以下資訊回答我的問題:\n\n{related_chunks}\n\n 問題:{query}")
    ]
    response = llm(messages=messages)
    return {
        "answer": response.content.strip(),
        "context": top_k  
    }


data_samples = {
    "question": [],
    "answer": [],
    "ground_truth": [],
    "contexts": []
}

for question, ground_truth in zip(questions, ground_truths):
    result = summary_chain(question)
    print('result:', result)

    contexts = [doc.page_content for doc in result['context']]
    print('contexts:', contexts)  
    print(len(contexts))  
    data_samples["question"].append(question)
    data_samples["answer"].append(result['answer'])
    data_samples["ground_truth"].append(ground_truth)
    data_samples["contexts"].append(contexts)


dataset = Dataset.from_dict(data_samples)
print('dataset:', dataset)
metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    harmfulness,
    answer_correctness
]
evaluation_result = evaluate(
    dataset=dataset,
    metrics=metrics,
    llm=critic_llm,
    embeddings=aoai_embeddings,
    run_config=RunConfig(max_workers=4,max_wait=180,log_tenacity=True,max_retries=3)
)

print(evaluation_result)