[DAY 22]AI 文獻處理實戰(1)：使用 PyPDF2、Ollama 和 Chromadb 建立你的自動化流程

2024 iThome 鐵人賽

DAY 22

自我挑戰組

30 天程式學習筆記：我的自學成長之路系列第 22 篇

16th鐵人賽

lafeeleaf

2024-09-22 00:01:38

247 瀏覽

分享至

在上一篇中，我們介紹了大語言模型(LLM)、Ollama 和檢索增強生成(RAG)技術的基本原理，以及它們如何應用於自動化文獻處理。接下來，要具體說明如何使用這些技術，實現自動化的文獻摘要和問題生成。這篇文章將帶你一步步了解如何將 PDF 自動處理為 Markdown 格式，包含文檔的摘要和關鍵問題。

1. 整體架構概覽

首先，我們的目標是：

輸入：本地端資料夾內的 PDF 文件
過程：透過本地部署的 LLM（如 Ollama）與 RAG 技術，分析文檔內容，生成摘要與提問
輸出：英文和中文摘要、關鍵問題與答案，並將結果以 Markdown 格式保存

2. 關鍵技術模組

(1) PyPDF2：提取 PDF 文檔內容

要處理 PDF 文檔，首先需要讀取其內容。這部分我們使用 PyPDF2 來提取文本。PyPDF2 能夠快速將 PDF 中的每一頁文本提取出來。

from PyPDF2 import PdfReader

def read_pdf(file_path):
    """讀取PDF文件並提取文本"""
    with open(file_path, 'rb') as file:
        reader = PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text

讀取 PDF 並返回其文本內容，為後續的摘要生成打下基礎。

(2) 檢索增強生成（RAG）：結合檢索與生成式 AI

RAG 的核心是先檢索再生成。我們使用 Chroma 向量資料庫來存儲和檢索文檔的嵌入表示，從而讓生成的答案更精確、更相關。

import chromadb

# 建立向量資料庫
client = chromadb.Client()
collection = client.create_collection(name="docs")

def add_document_to_collection(file_path):
    """將PDF文檔加入向量資料庫中"""
    text = read_pdf(file_path)
    response = ollama.embeddings(model=embedding_model, prompt=text)
    embedding = response["embedding"]
    collection.add(ids=["1"], embeddings=[embedding], documents=[text])

這個步驟會將每個文檔的內容轉換為向量並存入資料庫。當你要檢索文檔時，模型會先檢索出最相關的段落，再根據這些片段生成答案。

(3) Ollama：本地大語言模型進行摘要生成

我們使用 Ollama 有的模型來生成摘要。這個過程中，我們可以針對不同語言生成不同的摘要（如英文和中文）。

請先去下載 Ollama，並挑選適合的模型下載到本地端。

import ollama

def rag_process(query):
    """檢索相關文檔並使用LLM生成回應"""
    # 生成提示詞的嵌入
    response = ollama.embeddings(prompt=query, model=embedding_model)

    # 檢索相關文檔，並生成LLM回應
    output = ollama.generate(
        model=language_model,
        prompt=f"Using this data: {data}. Respond to this prompt: {query}"
    )
    return output['response']

rag_process 函數的目的是通過 Ollama 生成嵌入向量，檢索與文檔相關的內容，並進行回應。比如，我們可以要求模型「Summarize the document.」來生成英文摘要，或「用繁體中文概括這篇文檔。」來生成中文摘要。

(4) Markdown 格式輸出

最後，我們將所有生成的摘要和問題組織成 Markdown 格式，以方便後續的分享和使用。

def process_single_document(file_path):
    """處理單一PDF文件並生成Markdown"""
    # 將PDF文件添加到向量資料庫中
    add_document_to_collection(file_path)

    # 生成摘要和問題回答
    llm_response = generate_summary_and_questions()

    # 組織Markdown內容
    markdown_output = f"# {os.path.basename(file_path)}\n\n"
    markdown_output += "## Summary (English)\n\n"
    markdown_output += llm_response['summary_en'] + "\n\n"
    markdown_output += "## 摘要 (中文)\n\n"
    markdown_output += llm_response['summary_zh'] + "\n\n"
    markdown_output += "## Questions (English)\n\n"
    markdown_output += "\n".join(llm_response['questions_en']) + "\n\n"
    markdown_output += "## Answers (English)\n\n"
    markdown_output += "\n".join(llm_response['answers_en']) + "\n\n"
    markdown_output += "## 提問 (中文)\n\n"
    markdown_output += "\n".join(llm_response['questions_zh']) + "\n\n"
    markdown_output += "## 答案 (中文)\n\n"
    markdown_output += "\n".join(llm_response['answers_zh']) + "\n\n"

    # 將結果保存為Markdown文件
    file_name = f"{os.path.basename(file_path).replace('.pdf', '')}.md"
    with open(file_name, "w", encoding="utf-8") as f:
        f.write(markdown_output)

將生成的摘要和問題以Markdown格式保存，方便地在不同平台上查看或分享結果。