Day 7: 資料處理與知識庫模組實作-將繁中文件解析、向量化、並存入ChromaDB

2025 iThome 鐵人賽

DAY 7

生成式 AI

從 RAG 到 Agentic RAG：30 天打造本機智慧檢索系統系列第 7 篇

17th鐵人賽

seedfood

團隊躺平的內捲小隊

2025-09-21 23:01:44

199 瀏覽

分享至

前言

昨天我們先把需要的套件都安裝起來，今天要做的就是實際跑程式，將一份檔案處理完存入向量資料庫。
今天文章主要分下面幾段:

PDF Parsing
Chunking
啟動ChromaDB
將第一段處理好的資料存入ChromaDB
廢話不多說，我們就開始吧!

📄 組合技:pdfplumber + img2table=將PDF 文字與表格分離處理

在處理 PDF 文件時，經常會遇到 同一頁同時包含純文字與表格 的情況。
如果直接使用 pdfplumber.extract_text() 來讀取，表格裡的文字會和段落文字混在一起，
這樣不僅會影響檢索結果，也會讓後續的 chunk 切分失去準確性。

為了解決這個問題，我們設計了以下流程：

偵測表格範圍 (bbox)
- 透過 pdfplumber.find_tables() 取得每個表格的座標範圍。
- 接下來的處理會用這些範圍來區分純文字與表格內容。
純文字抽取與重建
- 使用 page.extract_words() 取得頁面所有文字，包含每個詞的位置資訊。
- 判斷每個文字是否落在任何表格的 bbox 內，如果在表格內則過濾掉。
- 將剩餘文字依行進行排序與合併，重建段落文字。
表格抽取與影像處理
- 對於每個表格 bbox，先將該區域裁切成圖片（使用 page.to_image().crop()）。
- 利用 img2table 套件與 Tesseract OCR 解析裁切後的表格圖片。
- 儲存每個表格的影像路徑、bbox 以及解析後的表格資料，方便後續使用。

輸出資料結構

每頁的資料包含：

{
    "page": 頁碼,
    "text": 重建後的段落文字,
    "tables": [
        {
            "bbox": 表格座標,
            "image_path": 表格裁切影像路徑,
            "tables": 解析後的表格資料
        },
        ...
    ]
}

這樣的結構可直接用於 chunking 或存入向量資料庫。

處理多語言 OCR 注意事項
- 若要處理繁體中文，需要在系統中安裝對應的 Tesseract 訓練檔：
```
tesseract-ocr-chi_tra
```
- 在函式中透過 ocr_lang="eng+chi_tra" 指定語言。

這邊直接提供可用的程式碼給大家參考

import os
import pdfplumber
from img2table.document import Image as ImgDoc
from img2table.ocr import TesseractOCR

# ---------- 工具函式 ----------
def bbox_intersect(b1, b2):
    """
    判斷兩個 box 是否相交。
    box 格式 (x0, top, x1, bottom) 或 (x0, y0, x1, y1) - 使用相同座標系即可。
    返回 True 如果有交集。
    """
    x0, y0, x1, y1 = b1
    X0, Y0, X1, Y1 = b2
    # 若完全不重疊則 return False
    return not (x1 <= X0 or X1 <= x0 or y1 <= Y0 or Y1 <= y0)

def group_words_to_lines(words, line_tol=3):
    """
    將 words (含 x0, top, x1, bottom, text) 依 top 排序，合併成行文字。
    line_tol 為判斷同一行的 top 容差（可視需要調整）。
    """
    if not words:
        return ""
    # 依 top, x0 排序
    words_sorted = sorted(words, key=lambda w: (round(w["top"]), w["x0"]))
    lines = []
    cur_top = None
    cur_words = []
    for w in words_sorted:
        if cur_top is None:
            cur_top = w["top"]
            cur_words = [w["text"]]
        elif abs(w["top"] - cur_top) <= line_tol:
            cur_words.append(w["text"])
        else:
            lines.append(" ".join(cur_words))
            cur_top = w["top"]
            cur_words = [w["text"]]
    if cur_words:
        lines.append(" ".join(cur_words))
    return "\n".join(lines)

# ---------- 主要處理函式 ----------
def extract_text_and_tables(pdf_path, out_dir="output_tables", dpi=150, ocr_lang="eng+chi_tra"):
    """
    處理 PDF：
    - 針對每頁：偵測表格 bbox（pdfplumber），
                   從 page.extract_words() 過濾位於表格內的詞 -> 組回純文字
    - 針對每個表格 bbox：用 page.to_image().crop(bbox) 輸出表格影像，交給 img2table 解析
    參數:
      pdf_path: PDF 檔路徑
      out_dir: 存 table 圖片與結果的資料夾
      dpi: page.to_image 解析參數（影響輸出圖片解析度）
      ocr_lang: Tesseract OCR 語言設定（需先安裝對應語言的 tesseract）
    回傳:
      pages_info: list of { "page": idx, "text": page_text, "tables": [ { "bbox": bbox, "image_path": ..., "tables": <img2table output>} ] }
    """
    os.makedirs(out_dir, exist_ok=True)
    pages_info = []
    ocr = TesseractOCR(lang=ocr_lang)

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, start=1):
            # 1) 取得 page 的 words（含位置）
            words = page.extract_words()  # list of dicts with x0, x1, top, bottom, text

            # 2) 偵測表格 (pdfplumber 的 table finder)
            #    page.find_tables() 會回傳 Table 物件，通常有 .bbox 屬性
            found_tables = page.find_tables()
            table_bboxes = [t.bbox for t in found_tables]  # bbox 格式 (x0, top, x1, bottom)

            # 3) 把在任何 table bbox 內的 words 過濾掉（保留純文字）
            if table_bboxes:
                filtered_words = []
                for w in words:
                    wb = (w["x0"], w["top"], w["x1"], w["bottom"])
                    in_table = any(bbox_intersect(wb, tb) for tb in table_bboxes)
                    if not in_table:
                        filtered_words.append(w)
            else:
                filtered_words = words

            # 4) 將過濾後的 words 組回行/段落 (簡單方式)
            page_text = group_words_to_lines(filtered_words)

            # 5) 針對每個 table bbox，輸出裁切圖片並用 img2table 處理
            page_tables_info = []
            if table_bboxes:
                # 先把 page 轉成影像物件（pdfplumber 提供 .to_image）
                page_image_wrapper = page.to_image(resolution=dpi)  # pdfplumber 的 PageImage wrapper
                for ti, bbox in enumerate(table_bboxes):
                    # bbox 使用 pdf 座標 (x0, top, x1, bottom)
                    # 使用 page_image_wrapper.crop(bbox) 輕鬆得到 cropped image wrapper
                    try:
                        cropped_img_wrapper = page_image_wrapper.crop(bbox)
                        pil_img = cropped_img_wrapper.original  # PIL Image
                    except Exception:
                        # 若 crop API 不同或失敗，嘗試手動轉換（此為 fallback）
                        pil_img = page_image_wrapper.original  # fallback 整頁（不建議）
                    img_path = os.path.join(out_dir, f"page_{i}_table_{ti}.png")
                    pil_img.save(img_path)

                    # 用 img2table 做解析（會回傳 table structure）
                    img_doc = ImgDoc(img_path)
                    tables_detected = img_doc.extract_tables(ocr=ocr)

                    page_tables_info.append({
                        "bbox": bbox,
                        "image_path": img_path,
                        "tables": tables_detected  # 這通常是 img2table 回傳的 table 物件或 dataframe 列表
                    })

            pages_info.append({
                "page": i,
                "text": page_text,
                "tables": page_tables_info
            })

    return pages_info

直接執行extract_text_and_tables這

pages = extract_text_and_tables("檔案名稱.pdf", out_dir="out_tables", dpi=300, ocr_lang="eng+chi_tra")

注意，dpi不要設太小，有些字會解析不出來

Chunking

處理完後我們就會取得PDF的內文和表格，針對文字的部分，別忘了要先Chunking才能進去向量資料庫
筆者習慣用LangChain輔助，程式碼如下:

from langchain.text_splitter import RecursiveCharacterTextSplitter

def convert_pages_to_documents(pages_info, source_name):
    docs = []
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

    for page in pages_info:
        page_num = page["page"]

        # 純文字部分
        if page["text"]:
            chunks = splitter.split_text(page["text"])
            for idx, chunk in enumerate(chunks):
                docs.append({
                    "id": f"{source_name}_p{page_num}_t{idx}",
                    "text": chunk,
                    "metadata": {"page": page_num, "source": source_name, "type": "text"}
                })

        # 表格部分
        for ti, tinfo in enumerate(page["tables"]):
            tables = tinfo["tables"]
            for tj, table in enumerate(tables):
                try:
                    df = table.df if hasattr(table, "df") else None
                    if df is not None:
                        table_text = df.to_string(index=False)
                        docs.append({
                            "id": f"{source_name}_p{page_num}_tbl{ti}_{tj}",
                            "text": table_text,
                            "metadata": {"page": page_num, "source": source_name, "type": "table"}
                        })
                except Exception as e:
                    print(f"跳過第 {page_num} 頁的表格，錯誤：{e}")

    return docs

這樣就完成PDF parsing +Chunking了! 其實方法非常多，筆者只是分享目前自己習慣的做法，尤其是在地端的狀況，筆者實測許多工具，還沒有找到更好的方式

啟動ChromaDB並客製化Embedding

由於我們要處理繁體中文，如同先前章節提到，我們要採用較適合中文場景的BGE-M3作為我們的embedding model，但要注意的是，大部分的向量資料庫內建不是BGE-M3，因此需要自己調整。
以ChromaDB為例，我們還需要自己用class包裝成符合ChromaDB的embedding model格式，實際做法如下:

from sentence_transformers import SentenceTransformer

# 載入 BGE-m3
embedding_model = SentenceTransformer("BAAI/bge-m3")

# 包裝成符合 Chroma 的 embedding function
class ChromaEmbeddingFunction:
   def __init__(self, model):
       self.model = model

   def __call__(self, input):
       return self.model.encode(input).tolist()
       
BGE_embedding_fn = ChromaEmbeddingFunction(embedding_model)

接下來就是先啟動ChromaDB
向量資料庫裡面是用Collection作為一個單位，所以我們需要建立一個給這個專案用的collection，例如下面我們指名叫做: Test_RAG，同時我們也可以指定要使用我們剛剛建置的BGE_embedding_fn作為embedding model。

import chromadb
from chromadb.utils import embedding_functions

# 初始化 ChromaDB client
client = chromadb.Client()

# 建立 collection
try:
    collection = client.get_collection(name="Test_RAG",embedding_function=BGE_embedding_fn)
    print("Collection 已存在，直接使用")
except:
    collection = client.create_collection(name="Test_RAG",embedding_function=BGE_embedding_fn)
    print("Collection 不存在，已創建新的")

將處理好的結果存入向量資料庫

下一步就是把我們剛剛已經處理好的步驟都加在一起!

def store_documents_in_batches(collection, chunks, embedding_function, batch_size=100):
    """
    分批將文檔存入 ChromaDB，使用客製化 embedding function
    
    Args:
        collection: ChromaDB collection 對象
        chunks: 文檔 chunks 列表
        embedding_function: 客製化的 embedding function
        batch_size: 每批處理的數量
    """
    total_chunks = len(chunks)
    successful_batches = 0
    failed_batches = []
    
    print(f"開始處理 {total_chunks} 個 chunks，批次大小: {batch_size}")
    print(f"使用客製化 embedding model: {type(embedding_function.model).__name__}")
    
    for i in range(0, total_chunks, batch_size):
        batch_end = min(i + batch_size, total_chunks)
        batch_chunks = chunks[i:batch_end]
        
        try:
            # 提取當前批次的數據
            documents = [chunk['text'] for chunk in batch_chunks]
            metadatas = [chunk['metadata'] for chunk in batch_chunks]
            ids = [chunk['id'] for chunk in batch_chunks]
            
            # 使用客製化 embedding function 生成 embeddings
            print(f"正在生成第 {successful_batches + 1} 批次的 embeddings...")
            embeddings = embedding_function(documents)
            
            # 存入 ChromaDB
            collection.add(
                documents=documents,
                metadatas=metadatas,
                ids=ids,
                embeddings=embeddings  # 加入客製化的 embeddings
            )
            
            successful_batches += 1
            print(f"✅ 批次 {successful_batches}: 已處理 {batch_end}/{total_chunks} ({batch_end/total_chunks*100:.1f}%)")
            
            # 釋放記憶體
            del documents, metadatas, ids, embeddings
            
        except Exception as e:
            print(f"❌ 批次 {i//batch_size + 1} 失敗: {str(e)}")
            failed_batches.append((i, batch_end, str(e)))
            continue
    
    print(f"\n處理完成:")
    print(f"成功批次: {successful_batches}")
    print(f"失敗批次: {len(failed_batches)}")
    
    if failed_batches:
        print("失敗的批次詳情:")
        for start, end, error in failed_batches:
            print(f"  範圍 {start}-{end}: {error}")
    
    return successful_batches, failed_batches

上面的寫法是用批次的方式去執行，執行方法如下:

successful, failed = store_documents_in_batches(
    collection=collection, 
    chunks=TSMC_S_report_chunks_recursive, 
    embedding_function=BGE_embedding_fn,
    batch_size=10  # 因為要生成 embeddings，建議減小批次大小
)

注意:如果你的電腦沒有GPU，embedding的部分會很慢

到了這邊，其實我們就完成了存入的動作!
測試的檢索方式如下:

query = "輸入你要問的問題"
query_vec = embedding_model.encode(query)
results = collection.query(
    query_embeddings=[query_vec],
    n_results=3
)

上面這段程式的運作方式是:

輸入query
透過embedding_model將query轉成向量。(前面已經將embedding_model指定為BGE-m3，特別注意，這邊用的embedding model一定要和你向量資料庫用的embedding model一致，不然是無法找回東西的!)
透過ChromaDB的collection.query，進入我們剛剛用的collection來做向量相似度查詢，並找回前3筆相似的資料

到這邊，就已經完成了RAG裡面的R- Retrieval 囉!