iT邦幫忙

2024 iThome 鐵人賽

DAY 27
0
生成式 AI

LLM 應用、開發框架、RAG優化及評估方法 系列 第 27

Day27 GAI爆炸時代 - Ragas介紹

  • 分享至 

  • xImage
  •  

Ragas介紹

https://ithelp.ithome.com.tw/upload/images/20240826/20168537MDkfzhCCvR.png

RAGAS(Retrieval-Augmented Generation Assessment System)是一個專門用來評估RAG(Retrieval-Augmented Generation)系統性能的工具,旨在提供全面和準確的測量,以幫助開發者和研究人員理解和改進RAG系統的表現。以下是RAGAS的詳細介紹:

1. RAGAS的設計目標

RAGAS的設計目標是為RAG系統提供一個標準化的評估框架,這個框架可以有效地衡量系統在各個方面的表現,如回應的準確性、相關性、流暢性和語義相似度等。透過這樣的評估,開發者可以更好地調整和優化系統,從而提升用戶體驗和系統的實用性。

2. RAGAS的主要功能

RAGAS提供了多樣化的功能,這些功能主要集中在以下幾個方面:

a. 多維度評估

RAGAS支持從多個維度對RAG系統進行評估,每個維度都代表了RAG系統性能的不同方面:

  • 語義相似度(Semantic Similarity):評估生成回應與參考答案之間的語義相似度。這個維度通常使用基於語義嵌入(Semantic Embedding)的方法來衡量,例如使用BERT等預訓練模型來計算兩個句子的餘弦相似度。
  • 精確性(Accuracy):評估生成的回答在信息層面的正確性。這個維度重點考慮回答內容是否符合問題的事實性和邏輯性。
  • 相關性(Relevance):評估生成的回答與輸入問題之間的相關程度。高相關性意味著生成的回答能夠直接解決或回應用戶提出的問題。
  • 流暢性(Fluency):評估生成的回答在語言表達上的自然程度,主要考察句子的語法結構和用詞是否符合語言規範。
  • 涵蓋率(Coverage):評估生成的回答對於問題涉及信息的涵蓋範圍,確保回答能夠全面且完整地回應用戶需求。

b. 語義相似度

語義相似度是RAGAS中的一個關鍵評估指標。RAGAS會計算生成的回答與標準答案之間的語義相似度,這通常通過語義嵌入技術來實現。例如,RAGAS可能會使用BERT或其他語言模型來將句子轉換為向量,然後計算這些向量之間的餘弦相似度。這種方法能夠比傳統的詞語匹配方法更好地捕捉到回答的語義內容。

c. 自動化評估

RAGAS支持大規模的自動化評估,這意味著它可以處理大量的測試樣本,並生成詳細的評估報告。這一特性對於需要頻繁調整和測試RAG系統的開發者來說非常重要,因為它能夠顯著縮短測試時間並提高測試效率。

d. 靈活性與可配置性

RAGAS設計為一個靈活且可配置的系統。用戶可以根據自身的需求自定義評估指標和權重,從而更好地適應不同應用場景的需求。例如,某些應用可能更加重視回答的精確性,而另一些應用則可能更關注回答的流暢性和語義相似度。RAGAS允許用戶根據這些需求進行配置,確保評估結果與實際需求一致。

3. RAGAS的應用場景

RAGAS適用於各種涉及信息檢索和生成的應用場景,如:

  • 問答系統:RAGAS可以用來評估問答系統生成回答的質量,幫助優化回答的準確性和相關性。
  • 聊天機器人:RAGAS可以評估聊天機器人在對話中的表現,確保回應的自然性和語義的一致性。
  • 內容生成:在內容生成應用中,RAGAS可以幫助評估生成內容的流暢性和主題一致性,從而提高內容的可讀性和用戶滿意度。

4. RAGAS的優勢

RAGAS相較於傳統的評估方法,具有以下優勢:

  • 全面性:它考慮了RAG系統的多個方面,提供了全方位的評估指標。
  • 自動化和高效性:RAGAS支持自動化處理大量數據,並能快速生成評估報告,節省了時間和人力成本。
  • 靈活配置:RAGAS允許用戶根據具體應用場景定制評估標準,具有很高的靈活性。

5. 未來展望

隨著RAG技術的發展,RAGAS也將不斷更新和完善,以適應不斷變化的技術需求。未來的RAGAS可能會引入更多的智能評估功能,如自動錯誤分析、生成內容的情感分析等,進一步提升其評估的準確性和實用性。

評估指標

https://ithelp.ithome.com.tw/upload/images/20240826/20168537e6Wddh3bfP.png

  • 官方文件:https://docs.ragas.io/en/latest/concepts/metrics/index.html

Ragas 提供許多評估指標供使用,可依照情境決定要套用哪個,這邊主要介紹此圖的六個指標
這邊來提供快速整理供記憶,細節就請各位去看一下官方文件啦

  • context_presion: user query跟參考的檔案是否相關
  • context_recall: 正解跟參考檔案關係
  • answer_relevancy: user query跟AI回應的關聯
  • faithfulness: 參考檔案跟AI回應關聯
  • answer_correctness、answer_similarity: AI回應跟正解關聯

Faithfulness

參考的檔案跟AI回應關,數值介於0~1之間,數值越高,代表回答得越好
https://ithelp.ithome.com.tw/upload/images/20240826/201685372ekC8cumFy.png

from datasets import Dataset 
from ragas.metrics import faithfulness
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness])
score.to_pandas()

answer_relevancy

user query跟AI回應的關聯,較高的分數代表回答的相關程度很高
https://ithelp.ithome.com.tw/upload/images/20240826/20168537xnMxvFamaS.png

from datasets import Dataset 
from ragas.metrics import answer_relevancy
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_relevancy])
score.to_pandas()

Context Precision

user query跟參考的檔案是否相關
https://ithelp.ithome.com.tw/upload/images/20240826/20168537b3KfAteJpJ.png

from datasets import Dataset 
from ragas.metrics import context_precision
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[context_precision])
score.to_pandas()

context_recall

正解跟參考檔案關係

https://ithelp.ithome.com.tw/upload/images/20240826/20168537Pk9cgi6RVy.png

from datasets import Dataset 
from ragas.metrics import context_recall
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], 
    ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[context_recall])
score.to_pandas()

answer_correctness

AI回應跟正解關聯

  • TP(True Positive):在標準答案和生成答案中都存在的事實。
  • FP(False Positive):在生成答案中存在但不在標準答案中的事實。
  • FN(False Negative):在標準答案中存在但不在生成答案中的事實。

https://ithelp.ithome.com.tw/upload/images/20240826/20168537VWuIoJcytB.png

這邊是改良版的F1 Score,此公式的目的在於同時考慮TP、FP和FN,但在計算時給予了FP和FN較低的權重(0.5),可減少它們對最終F1 Score的影響。優點:

  1. 減少懲罰:如果FP和FN的數量較多,傳統的F1 Score會很低。通過將FP和FN的影響減半,可以避免因少數錯誤而過度懲罰模型。
  2. 平衡權重:0.5的權重讓這個計算方式在強調正確回答(TP)的同時,不會過於嚴苛地懲罰錯誤回答。

這種方法在一些特定情境下可能會提供更合理的評估,尤其是當我們希望對模型的錯誤寬容一些時

  1. 事實相似度:生成答案與標準答案之間的事實重疊。

使用指定的嵌入模型將標準答案和生成答案向量化,然後計算這兩個向量之間的餘弦相似度

這兩個方面通過加權計算得到最終的Answer Correctness分數。用戶還可以使用“閾值”將結果四捨五入為二進制值(即正確或不正確)

from datasets import Dataset 
from ragas.metrics import faithfulness, answer_correctness
from ragas import evaluate

data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_correctness])
score.to_pandas()

Answer semantic similarity

AI回應跟正解關聯

from datasets import Dataset 
from ragas.metrics import answer_similarity
from ragas import evaluate


data_samples = {
    'question': ['When was the first super bowl?', 'Who won the most super bowls?'],
    'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],
    'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_similarity])
score.to_pandas()

最後,這邊提供整合多個評股指標的範例程式碼:

from datasets import Dataset
from ragas import evaluate
from langchain.schema import SystemMessage, HumanMessage
from ragas.metrics import (
    context_precision,
    answer_relevancy,
    faithfulness,
    context_recall,
    answer_correctness
)
from ragas.metrics.critique import harmfulness
from ragas.run_config import RunConfig

questions = [
    "問題1",
    "問題2"
]

ground_truths = [
    "正解1",
    "正解2"
]


def summary_chain(query: str):
    top_k = retrieve(query) # 回傳原始chunk的content
    related_chunks = "\n\n".join([doc.page_content for doc in top_k])
    messages = [
        SystemMessage(content="你是一個非常了解銀行法規相關資訊的人"),
        HumanMessage(content=f"請根據以下資訊回答我的問題:\n\n{related_chunks}\n\n 問題:{query}")
    ]
    response = llm(messages=messages)
    return {
        "answer": response.content.strip(),
        "context": top_k  
    }


data_samples = {
    "question": [],
    "answer": [],
    "ground_truth": [],
    "contexts": []
}

for question, ground_truth in zip(questions, ground_truths):
    result = summary_chain(question)
    print('result:', result)

    contexts = [doc.page_content for doc in result['context']]
    print('contexts:', contexts)  
    print(len(contexts))  
    data_samples["question"].append(question)
    data_samples["answer"].append(result['answer'])
    data_samples["ground_truth"].append(ground_truth)
    data_samples["contexts"].append(contexts)


dataset = Dataset.from_dict(data_samples)
print('dataset:', dataset)
metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    harmfulness,
    answer_correctness
]
evaluation_result = evaluate(
    dataset=dataset,
    metrics=metrics,
    llm=critic_llm,
    embeddings=aoai_embeddings,
    run_config=RunConfig(max_workers=4,max_wait=180,log_tenacity=True,max_retries=3)
)

print(evaluation_result)

以上就是Ragas 評測的介紹!


上一篇
Day26 GAI爆炸時代 - RAG 評估方法
下一篇
Day28 GAI爆炸時代 - 聯發科 達哥& SUPERIOR API 平台
系列文
LLM 應用、開發框架、RAG優化及評估方法 30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言