生成式 AI 開發高度依賴實驗和優化過程。考慮到不同的模型架構、提示詞策略和訓練參數對模型性能的影響,每次實驗的結果都必須詳細記錄和比較。隨著GenAI 應用的規模擴大,開發者面臨的挑戰也日益增加,如何有效追蹤實驗結果、管理資源消耗,以及確保模型性能隨時間持續優化,這些都需要有效的工具和方法來管理。
過去在開發中,曾嘗試設計自動化記錄工具,但由於需要記錄的資料龐大且資料表結構複雜,尚未完成。在課程中,作者是使用了現成的工具CometML。而相對於 fine-tune,考量自己對 RAG 更為熟悉,因此以下參考課程內容,不過修改為以管理 RAG 相關實驗及評估表現為例進行說明:
以下是一段 Python 程式碼,用於設置並記錄 RAG 實驗,利用 CometML 進行:
import time
import psutil
import yaml
from typing import List, Dict, Any
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts import PromptTemplate
from comet_ml import Experiment
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
class RAGExperiment:
# 初始化:檢索工具與模型參數
def __init__(self, config: Dict[str, Any]):
self.config = config
self.experiment = self._init_experiment()
self.texts = self._process_documents()
self.retriever = self._setup_retriever()
self.qa_chain = self._setup_rag_chain()
self.sentence_model = SentenceTransformer(self.config['retriever']['embedding_model'])
# 初始化:實驗記錄
def _init_experiment(self) -> Experiment:
experiment = Experiment(
api_key=self.config['comet_ml']['api_key'],
project_name=self.config['comet_ml']['project_name'],
workspace=self.config['comet_ml']['workspace']
)
experiment.set_name(f"RAG_Experiment_{time.strftime('%Y%m%d_%H%M%S')}")
experiment.log_parameters(self.config)
return experiment
# 資料前處理
def _process_documents(self) -> List[Any]:
loader = TextLoader(self.config['data']['file_path'])
documents = loader.load()
text_splitter = CharacterTextSplitter(
chunk_size=self.config['data']['chunk_size'],
chunk_overlap=self.config['data']['chunk_overlap']
)
texts = text_splitter.split_documents(documents)
self.experiment.log_parameter("document_count", len(texts))
return texts
# 設置檢索工具
def _setup_retriever(self) -> Any:
embeddings = HuggingFaceEmbeddings(model_name=self.config['retriever']['embedding_model'])
vectorstore = Chroma.from_documents(self.texts, embeddings)
return vectorstore.as_retriever(search_kwargs={"k": self.config['retriever']['k']})
# 設置檢索鏈
def _setup_rag_chain(self) -> RetrievalQA:
llm = OpenAI(temperature=self.config['llm']['temperature'])
prompt_template = PromptTemplate(
input_variables=["context", "question"],
template=self.config['rag']['prompt_template']
)
self.experiment.log_text("prompt_template", prompt_template.template)
return RetrievalQA.from_chain_type(
llm=llm,
chain_type=self.config['rag']['chain_type'],
retriever=self.retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt_template}
)
# 設置:計算問答相關性
def calculate_relevance_score(self, query: str, answer: str) -> float:
query_embedding = self.sentence_model.encode([query])
answer_embedding = self.sentence_model.encode([answer])
relevance_score = cosine_similarity(query_embedding, answer_embedding)[0][0]
return relevance_score
# 設置:執行問答
def run_query(self, query: str) -> Dict[str, Any]:
start_time = time.time()
result = self.qa_chain({"query": query})
end_time = time.time()
latency = end_time - start_time
answer = result["result"]
answer_length = len(answer.split())
relevance_score = self.calculate_relevance_score(query, answer)
self.experiment.log_metric("query_latency", latency)
self.experiment.log_metric("answer_length", answer_length)
self.experiment.log_metric("relevance_score", relevance_score)
self.experiment.log_text("query", query)
self.experiment.log_text("answer", answer)
self.experiment.log_text("retrieved_documents", "\n\n".join([doc.page_content for doc in result["source_documents"]]))
if 'full_prompt' in result:
self.experiment.log_text("full_prompt", result['full_prompt'])
return result, latency, answer_length, relevance_score
# 實際執行問答並評估
def run_evaluation(self):
for query in self.config['evaluation']['queries']:
result, latency, answer_length, relevance_score = self.run_query(query)
print(f"Query: {query}")
print(f"Answer: {result['result']}")
print(f"Latency: {latency:.2f} seconds")
print(f"Answer Length: {answer_length} words")
print(f"Relevance Score: {relevance_score:.4f}\n")
# 設置:記錄系統的 CPU 和記憶體使用情況
def log_system_metrics(self):
self.experiment.log_metric("cpu_usage", psutil.cpu_percent())
self.experiment.log_metric("memory_usage", psutil.virtual_memory().percent)
# 記錄系統監控指標並結束實驗
def end_experiment(self):
self.log_system_metrics()
self.experiment.end()
def load_config(config_path: str) -> Dict[str, Any]:
with open(config_path, 'r') as f:
return yaml.safe_load(f)
def run_experiment(config_path: str):
config = load_config(config_path)
rag_experiment = RAGExperiment(config)
rag_experiment.run_evaluation()
rag_experiment.end_experiment()
if __name__ == "__main__":
run_experiment('config.yaml')
為了方便不同實驗間的參數配置,我們選擇使用 YAML 文件來管理實驗參數。這使得在測試不同的 prompt 時能輕易地調整與比較。
config.yaml
這是一個基本的配置文件,包含了執行 RAG 實驗所需的所有參數:
comet_ml:
api_key: "your_api_key"
project_name: "RAG_Optimization"
workspace: "your_workspace"
data:
file_path: "path/to/your/file.txt"
chunk_size: 1000
chunk_overlap: 0
retriever:
embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
k: 4
llm:
temperature: 0.7
rag:
chain_type: "stuff"
prompt_template: |
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Answer:
evaluation:
queries:
- "What is the capital of France?"
- "Explain the theory of relativity."
- "Who wrote 'To Kill a Mockingbird'?"
為了比較不同的 prompt 效果,我們可以創建另一個配置文件,主要修改 prompt 模板:
# ... 其他配置保持不變 ...
rag:
chain_type: "stuff"
prompt_template: |
You are a helpful AI assistant. Use the provided context to answer the question.
If the context doesn't contain relevant information, say "I don't have enough information to answer this question."
Context:
{context}
Human: {question}
AI:
透過修改主程序,我們能夠系統地比較不同提示的效果:
if __name__ == "__main__":
prompt_configs = ['config_prompt1.yaml', 'config_prompt2.yaml']
for config_path in prompt_configs:
print(f"Running experiment with config: {config_path}")
run_experiment(config_path)
我們也可以自動生成並測試多種提示模板,以尋找最佳組合:
def generate_prompts():
base_template = "Use the following context to answer the question.\n\nContext: {context}\n\nQuestion: {question}\nAnswer:"
variations = [
"Be concise and specific.",
"Provide a detailed explanation.",
"If you're unsure, say so.",
# ...
]
return [base_template + " " + variation for variation in variations]
if __name__ == "__main__":
prompts = generate_prompts()
for i, prompt_template in enumerate(prompts):
config = load_config('base_config.yaml')
config['rag']['prompt_template'] = prompt_template
experiment_name = f"Prompt_Variation_{i}"
config['comet_ml']['experiment_name'] = experiment_name
print(f"Running experiment: {experiment_name}")
run_experiment(config)
ref.