(RAG 1-6) 動手實戰——30 分鐘搭建第一個企業 RAG 系統

2025 iThome 鐵人賽

DAY 6

生成式 AI

30天RAG一點通系列第 6 篇

17th鐵人賽

dallen12151830

2025-08-30 14:08:43

322 瀏覽

分享至

核心概念

RAG 系統核心組件開發與整合，從資料讀取、切分、嵌入、向量存儲到問答邏輯，建立完整的 RAG 處理管線。

學習內容

基礎環境搭建（Python + LLM API + 向量資料庫）

文檔處理（PDF → Text → Chunk）

向量化與存儲（Embeddings + FAISS）

查詢與問答（Retriever + LLM 組合）

成本優勢說明

本教程使用的所有模型都是完全免費的開源方案：

嵌入模型：BGE-large-zh-v1.5（專為中文優化）
語言模型：Mistral-7B-Instruct（Mistral AI 開源，商用友善授權）
向量資料庫：FAISS（Meta 開源，本地部署無API費用）
運行環境：Google Colab Pro（約 $10/月，提供 A100/V100 GPU）

任務場景描述

業務背景：某保險公司需要建立智能客服系統，協助客戶快速查詢「海外旅行不便險」的理賠條件、申請流程和保障範圍。
技術挑戰：

保險條款用詞專業，需要精確引用條文號碼和內容
客戶提問多樣化，從理賠流程到承保範圍都有
回答需要符合金融業合規要求，不可提供條款外的資訊
系統需要追溯答案來源，確保可驗證性

預期成果：建立一個能準確回答保險條款相關問題的RAG系統，回答格式規範且引用明確。

程式碼範例

環境搭建與依賴安裝

# Login to Hugging Face using your personal access token
from huggingface_hub import login
Token = "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"
login(Token)

# Install packages
!pip install -q langchain-community langchain faiss-cpu sentence-transformers pypdf transformers accelerate # -q for quiet

# Load and split PDF document
import re
from langchain.document_loaders import PyPDFLoader
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

文檔處理
我們用 pypdf 把 PDF 轉成文字，然後用 LangChain 的 RecursiveCharacterTextSplitter 來做 chunk。

import re
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Step 1: Load text in PDF
loader = PyPDFLoader("海外旅行不便險條款.pdf")
documents = loader.load()

def fix_broken_lines(text: str) -> str:
    lines = text.split('\n')
    fixed_lines = []

    for i in range(len(lines)):
        current = lines[i].strip()
        if not current:
            continue  # 跳過空白行

        if i > 0:
            prev = fixed_lines[-1]

            # 若上一行是條文標題，保留換行
            if re.match(r'^第[一二三四五六七八九十百\d]+條', prev):
                fixed_lines.append(current)
                continue

            # 若上一行不是完整句，且這行不是條號/列表開頭 → 合併
            if not prev.endswith(('。', '！', '？', '：', '；')) and \
               not re.match(r'^[一二三四五六七八九十\d]+\s*[、.)]', current):
                fixed_lines[-1] += current
                continue

        fixed_lines.append(current)

    return '\n'.join(fixed_lines)

def merge_and_split_articles(documents):
    """
    Merge all pages from a PDF into a single text string,
    clean out unwanted page numbers, and then split the content
    into structured insurance article chunks.

    Args:
        documents (List[Document]): List of LangChain Document objects loaded from a PDF.

    Returns:
        List[Document]: A list of Documents, each representing a full or partial article chunk.
    """

    full_text = ""
    body_list = []
    # Step 1: Merge all pages and remove page numbers
    for doc in documents:
        page = doc.page_content
        # Remove page number at the top (e.g., "37\n")
        cleaned_page = re.sub(r'^\s*\d+\s*\n', '', page, flags=re.MULTILINE)
        # Append the cleaned page to the full text
        full_text += cleaned_page + "\n"
        # 🔍 尋找原始的短期費率表段落（原始格式不清晰，所以只抓「短期費率表」起始到結尾）
    # 嘗試從 full_text 擷取附表，並先從原始文本中移除
    rate_table_pattern = r"(附表[\s\S]*?短期費率表[\s\S]+?其餘[^\n]*?四捨五入。)"
    appendix_match = re.search(rate_table_pattern, full_text)
    appendix_text = None
    if appendix_match:
        appendix_text = """附表 短期費率表

        天數與對應之費率係數如下：

        - 天數 1 日：0.086。
        - 天數 3 日：0.101。
        - 天數 5 日：0.162。
        - 天數 7 日：0.187。
        - 天數 14 日：0.255。
        - 天數 21 日：0.319。
        - 天數 31 日：0.385。
        - 天數 45 日：0.483。
        - 天數 60 日：0.591。
        - 天數 90 日：0.731。
        - 天數 120 日：0.856。
        - 天數 150 日：0.941。
        - 天數 180 日：1.000。

        註：其餘未列出的天數，其費率係數依相鄰兩個天數進行線性內插計算，取至小數點後第三位，並四捨五入。"""

        # 從全文中移除原始附表
        full_text = full_text.replace(appendix_match.group(), "")

    # Step 2: Extract articles from remaining full_text
    pattern = r'((第[一二三四五六七八九十百\d]+條)|附表)\s+([^\n]+)\n?([\s\S]*?)(?=(?:第[一二三四五六七八九十百\d]+條\s+[^\n]+\n|附表\s+[^\n]+\n)|\Z)'
    matches = re.findall(pattern, full_text)

    split_docs = []

    # Step 3: 正常處理每一條條文
    for idx, (article_no, _, title, body) in enumerate(matches):
        full_title = f"{article_no} {title}".strip()
        body = body.strip()
        body = fix_broken_lines(body)
        body_list.append(len(body))

        # 加入條號欄位
        article_number_only = article_no.strip() if article_no else ""

        print(full_title)
        split_docs.append(Document(
            page_content=f"{full_title}\n{body}",
            metadata={
                "article": full_title,
                "article_no": article_number_only,
                "title": title.strip(),
                "chunk_type": "article_full"
            }
        ))

    # Step 4: 將附表加入為一條獨立條文
    if appendix_text:
        split_docs.append(Document(
            page_content=appendix_text.strip(),
            metadata={
                "article": "附表 短期費率表",
                "article_no": "附表",
                "title": "短期費率表",
                "sub_chunk": 0,
                "chunk_type": "appendix"
            }
        ))
    print("body_list =", body_list)
    return split_docs

嵌入與向量庫

這裡用 BGE 中文嵌入模型 + FAISS，模擬企業常見的本地向量存儲。


from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# Step 5: Convert text chunks into embeddings and build a FAISS vector store
# embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# embedding_model = HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")
embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-zh-v1.5",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

db = FAISS.from_documents(docs, embedding_model)



# Choose an open-source instruction-tuned model (Mistral-7B Instruct)
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
# model_id = "HuggingFaceH4/zephyr-7b-beta"

# Load tokenizer and model with automatic device placement (GPU/CPU)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)

# Step 6.1: Set up text-generation pipeline
pipe = pipeline("text-generation",
                model=model,
                tokenizer=tokenizer,
                max_new_tokens=512,
                do_sample=True,
                temperature=0.1,
                top_p=0.9,
                return_full_text=False,
                pad_token_id=tokenizer.eos_token_id
                )
llm = HuggingFacePipeline(pipeline=pipe)

自定義Prompt模板

# Step 6.2: Define a custom prompt template in Traditional Chinese
custom_prompt = PromptTemplate.from_template("""
你是一位旅行保險條款專家，請你根據下列 context 條款內容，簡潔且準確地回答使用者提出的單一問題。

請嚴格遵守以下規則：

回答原則：
- 僅針對最後一行的「問題」作答，**不得自行添加或延伸問題**
- 僅可根據 context 中提供的條款內容回答，**不得推論條文以外的資訊**
- 若 context 條文屬於「不保事項」條款，請明確指出該情況「不在理賠範圍內」
- 若無足夠資訊可回答，請回覆：「我不知道。」
- 作答格式規定如下：

引用格式要求：
- 若僅需引用一條，請先標明條號，再用清楚語句摘要其內容
- 若需引用多條，使用條列方式逐條說明，每條皆須：
  - 以「根據第Ｘ條，⋯⋯」開頭
  - 摘要該條實際條文之重點內容
- **嚴禁將多條條號合併成一句話**（例如「根據第X、Y、Z條」）

條號與內容對應規範：
- 每一條說明內容必須與開頭標示的條號一一對應，**不得錯引其他條文的內容**
- 嚴禁出現「條號正確但摘要內容屬於他條」的情況
- 嚴禁引用與問題主題無關的條款（例如：旅遊延誤問題引用行李延誤條文）

語言與結尾：
- 回答請使用繁體中文，表達清楚明確，語氣中立且專業
- 回答完成後**請立即結束，不要產生額外內容**

你可以參考的條號有：{article_list}。

<context>
{context}
</context>

請回答以下問題：
問題：{question}
答案：
""")


# Step 6.3: Set up retriever to fetch the top-5 most relevant chunks from FAISS
retriever = db.as_retriever(search_kwargs={"k": 5})

qa_chain = LLMChain(llm=llm, prompt=custom_prompt)

RAG問答函數

def rag_answer(query: str):
    # query_for_embedding = "为这个句子生成表示以用于检索相关文章：" + query

    docs = retriever.get_relevant_documents(query)

    context = "\n\n".join(doc.page_content for doc in docs)

    article_list = "、".join(sorted({doc.metadata.get("article_no") for doc in docs if doc.metadata.get("article_no")})) or "無明確條號"
    print("Context: ")
    print(context)
    print("article_list =", article_list)
    response = qa_chain.run({
        "context": context,
        "question": query,
        "article_list": article_list
    })

    return response

def run_rag_answer(query: str):
  import re

  def fix_duplicate_list_items(text: str):
    lines = text.strip().split("\n")
    seen = set()
    output = []
    pattern = re.compile(r"^(\d+)[\.\、]?\s*(.+)")

    for line in lines:
        match = pattern.match(line)
        if match:
            key = match.group(2).strip()
            if key not in seen:
                seen.add(key)
                output.append(line)
        else:
            output.append(line)

    return "\n".join(output)
  answer = rag_answer(query)
  # query_for_embedding = "为这个句子生成表示以用于检索相关文章：" + query
  docs_and_scores = db.similarity_search_with_score(query, k=5)

  print("Top 5 Retrieved Chunks Lengths:")
  for i, (doc, score) in enumerate(docs_and_scores):
      length = len(doc.page_content)
      article = doc.metadata.get("article", "N/A")
      print(f"Chunk {i+1}: Length = {length}, Score = {score:.4f}, Article = {article}")
      # print(doc.metadata.get("article", ""))
      # print(doc.page_content[:300])
  print("Q:", query)
  print("A:", answer)
  def check_output_and_prompt_continue(response_text: str, max_tokens=512):
      tokens = tokenizer.encode(response_text, add_special_tokens=False)
      token_count = len(tokens)
      if token_count >= max_tokens:
          print(f"⚠️ 回覆已達上限字數（{token_count} tokens）。若需繼續閱讀，輸入「是」或是「繼續」。")
      return response_text, token_count >= max_tokens

  def continue_response_with_prompt(last_response: str, query: str, context: str, article_list: str):

      continuation_prompt = PromptTemplate.from_template("""
你是一位旅行保險條款專家，請你根據下列 context 條款內容，簡潔且準確地**接續回答使用者的問題**。

請嚴格遵守以下規則：
- 續答內容需與原始回答格式一致，**從尚未完成的列點繼續往下編號**
- 僅可根據 context 中提供的條款內容回答，**不得推論條文以外的資訊**
- **不要重複前面已列出的內容**
- 條文引用需正確對應條號與條文內容
- 請使用繁體中文，語氣中立且專業，回答結束後**不要產生額外內容**

你可以參考的條號有：{article_list}。

<context>
{context}
</context>

以下是上一輪的回答內容：
{last_response}

請接續回答原始問題：{question}
答案：
""")

      rendered_prompt = continuation_prompt.format(
        last_response=last_response,
        question=query,
        context=context,
        article_list=article_list
      )
      result = llm(rendered_prompt)
      return result


  checked_response, is_cutoff = check_output_and_prompt_continue(answer)
  if is_cutoff:
        user_input = input("💬 是否繼續回答來接續內容？ \n> ")
        if "繼續" in user_input or "是" in user_input:
            # 🔁 重建 context 和 article_list
            docs = retriever.get_relevant_documents(query)
            context = "\n\n".join(doc.page_content for doc in docs)
            article_list = "、".join(sorted({doc.metadata.get("article_no") for doc in docs if doc.metadata.get("article_no")})) or "無明確條號"
            continuation = continue_response_with_prompt(checked_response, query, context, article_list)


            print("\n📩 後續回答：\n", fix_duplicate_list_items(continuation))

實際測試案例
6.1：行李遺失理賠流程

query = "行李遺失後應該如何申請理賠？"
run_rag_answer(query)

輸出結果：

article_list = 第三十八條、第四十三條、第四十二條、第四十四條、第四十條
Top 5 Retrieved Chunks Lengths:
Chunk 1: Length = 126, Score = 0.4401, Article = 第四十三條 行李損失保險理賠文件
Chunk 2: Length = 76, Score = 0.5496, Article = 第三十八條 行李延誤保險理賠文件
Chunk 3: Length = 133, Score = 0.5753, Article = 第四十二條 行李損失保險事故發生時之處理
Chunk 4: Length = 84, Score = 0.6022, Article = 第四十四條 追回處理
Chunk 5: Length = 354, Score = 0.6126, Article = 第四十條 行李損失保險特別不保事項（物品）
Q: 行李遺失後應該如何申請理賠？
A: 根據第四十三條，被保險人應檢具下列文件：
一、理賠申請書。
二、因第三十九條第一項第一款所列事故申請理賠者：向警方報案證明。
三、因第三十九條第一項第二款所列事故申請理賠者：公共交通工具業者所開立之事故與損失證明。

6.2 班機延誤賠償條件

query = "什麼情況下可以申請班機延誤賠償？"
run_rag_answer(query)

輸出結果：

article_list = 第三十一條、第三十七條、第三十二條、第三十八條、第三十條
Top 5 Retrieved Chunks Lengths:
Chunk 1: Length = 92, Score = 0.5009, Article = 第三十二條 班機延誤保險理賠文件
Chunk 2: Length = 229, Score = 0.5605, Article = 第三十一條 班機延誤保險特別不保事項
Chunk 3: Length = 355, Score = 0.6023, Article = 第三十條 班機延誤保險承保範圍
Chunk 4: Length = 76, Score = 0.6637, Article = 第三十八條 行李延誤保險理賠文件
Chunk 5: Length = 119, Score = 0.7090, Article = 第三十七條 行李延誤保險特別不保事項
Q: 什麼情況下可以申請班機延誤賠償？
A: 根據第三十條，被保險人於本保險契約保險期間內，以乘客身分所搭乘之定期航班較預定出發時間延誤四小時以上者，本公司依本保險契約約定之保險金額給付保險金。