[DAY 23]AI 文獻處理實戰(2)：從文獻理解到輸出 Markdown 重點整理，程式碼完整解析

2024 iThome 鐵人賽

DAY 23

自我挑戰組

30 天程式學習筆記：我的自學成長之路系列第 23 篇

[DAY 23]AI 文獻處理實戰(2)：從文獻理解到輸出 Markdown 重點整理，程式碼完整解析

16th鐵人賽

lafeeleaf

2024-09-23 00:00:12

276 瀏覽

分享至

在上一篇文章中，我們概述了自動化文獻處理的流程，接下來，我們將深入探討程式碼的細節，實現自動摘要、提問生成及 Markdown 輸出。

1. 基礎環境設置

首先，我們需要準備一些基礎的 Python 套件來處理 PDF、進行嵌入生成及自然語言生成。

PyPDF2：處理 PDF 文件，從中提取文字內容。
ollama：用於本地語言模型（LLM）的嵌入生成和文本生成。
chromadb：用於建立與檢索增強生成（RAG）技術的向量資料庫。

使用 python 3.10 環境並安裝相關套件：

pip install PyPDF2
pip install ollama
pip install chromadb

2. 提取 PDF 文檔內容

我們需要一個函數來讀取 PDF 文檔，並將其轉換為文本格式。這樣可以進行後續的摘要和問題生成。

PdfReader：讀取 PDF 文件。reader.pages 用來遍歷 PDF 的每一頁，並提取文本。
extract_text()：從每一頁中提取純文本，最終合併為一個完整的字符串。

3. 建立向量資料庫並存儲文檔嵌入

檢索增強生成（RAG）的核心是向量資料庫，這裡我們使用 chromadb 來儲存每個文檔的嵌入。

chromadb.Client()：創建向量資料庫的客戶端。
create_collection：建立文檔集合來存儲文檔的嵌入向量。
ollama.embeddings()：生成文檔的嵌入，將文檔內容轉換為數字表示。

4. 使用 RAG 進行檢索與生成

RAG 技術的第一步是檢索與生成。在這裡，我們使用嵌入進行檢索，並使用 Ollama 模型生成回應。

query_embeddings：對用戶輸入的查詢進行嵌入生成，並通過這個嵌入來檢索資料庫中的最相關文檔。
generate()：根據檢索到的文檔內容生成模型回應。這裡的回應可以是摘要，也可以是對特定問題的回答。

5. 生成文檔摘要與問題

我們需要生成兩種語言的文檔摘要，以及關鍵問題與其對應的答案。

rag_process()：生成英文和中文的摘要，並根據常見問題生成對應的回答。
questions_en 和 questions_zh：定義一些文檔分析常見的問題，並通過模型生成答案。

6. 將結果保存為 Markdown 文件

為了方便後續的文檔處理與閱讀，我們將所有生成的內容組織成 Markdown 格式，並保存到本地。

os.path.basename(file_path)：提取 PDF 文件名稱，用來命名輸出的 Markdown 文件。
elapsed_time：記錄處理時間，並將其作為文件名的一部分，方便後續追踪每次運行的效率。

完整程式碼

import os
from PyPDF2 import PdfReader
import ollama
import chromadb
import time

# 建立向量資料庫
client = chromadb.Client()
collection = client.create_collection(name="docs")

def read_pdf(file_path):
    """讀取PDF文件並提取文本"""
    with open(file_path, 'rb') as file:
        reader = PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text

def add_document_to_collection(file_path):
    """將單一PDF文檔加入到向量資料庫中"""
    text = read_pdf(file_path)
    response = ollama.embeddings(model=embedding_model, prompt=text)
    embedding = response["embedding"]
    collection.add(
        ids=["1"],  # 因為只處理一個文件，ID可以固定為1
        embeddings=[embedding],
        documents=[text]
    )

def rag_process(query):
    """檢索相關文檔並使用LLM生成回應"""
    # 生成提示詞的嵌入
    response = ollama.embeddings(
        prompt=query, 
        model=embedding_model
    )
    # 檢索最相關的文檔
    results = collection.query(
        query_embeddings=[response["embedding"]],
        n_results=1
    )
    data = results['documents'][0][0]

    # 生成LLM回應
    output = ollama.generate(
        model=language_model,
        prompt=f"Using this data: {data}. Respond to this prompt: {query}"
    )
    return output['response']

def generate_summary_and_questions():
    """生成文檔的摘要和問題回答"""
    summary_en = rag_process("Summarize the document.")
    summary_zh = rag_process("用繁體中文概括這篇文檔。")

    questions_en = [
        "What is the main focus of the paper?",
        "What methods were used?",
        "What are the key contributions?"
    ]
    
    questions_zh = [
        "這篇論文的主要焦點是什麼？",
        "使用了什麼方法？",
        "關鍵貢獻是什麼？"
    ]
    
    answers_en = [rag_process(q) for q in questions_en]
    answers_zh = [rag_process(q) for q in questions_zh]
    
    return {
        "summary_en": summary_en,
        "summary_zh": summary_zh,
        "questions_en": questions_en,
        "answers_en": answers_en,
        "questions_zh": questions_zh,
        "answers_zh": answers_zh
    }

def process_single_document(file_path):
    """處理單一PDF文件並生成Markdown"""
    # 開始計時
    start_time = time.time()
    
    # 將PDF文件添加到向量資料庫中
    add_document_to_collection(file_path)
    
    # 生成摘要和問題回答
    llm_response = generate_summary_and_questions()
    
    # 組織Markdown內容
    markdown_output = f"# {os.path.basename(file_path)}\n\n"
    markdown_output += "## Summary (English)\n\n"
    markdown_output += llm_response['summary_en'] + "\n\n"
    markdown_output += "## 摘要 (中文)\n\n"
    markdown_output += llm_response['summary_zh'] + "\n\n"
    markdown_output += "## Questions (English)\n\n"
    markdown_output += "\n".join(llm_response['questions_en']) + "\n\n"
    markdown_output += "## Answers (English)\n\n"
    markdown_output += "\n".join(llm_response['answers_en']) + "\n\n"
    markdown_output += "## 提問 (中文)\n\n"
    markdown_output += "\n".join(llm_response['questions_zh']) + "\n\n"
    markdown_output += "## 答案 (中文)\n\n"
    markdown_output += "\n".join(llm_response['answers_zh']) + "\n\n"
    
    # 結束計時
    end_time = time.time()
    elapsed_time = end_time - start_time
    elapsed_time_formatted = f"{elapsed_time:.2f}s"

    # 根據pdf名稱命名markdown文件
    #file_name = f"{os.path.basename(file_path)}.md"
    
    # 根據pdf名稱、模型名稱與執行時間命名markdown文件
    file_name = f"{os.path.basename(file_path).replace('.pdf', '')}_{embedding_model.replace(':', '-')}_{language_model.replace(':', '-')}_{elapsed_time_formatted}.md"
    
    # 將最終的Markdown內容寫入文件
    with open(file_name, "w", encoding="utf-8") as f:
        f.write(markdown_output)

# 指定單一PDF文件的路徑
pdf_file = r"C:\Users\USER\Desktop\YOUR_PDF.pdf"

# 可自行挑選合適的模型
# 語言模型名稱
embedding_model = "mxbai-embed-large:335m" # embedding 模型
language_model = "llama3.1:8b" # 大型語言模型
process_single_document(pdf_file)

當你運行這段程式碼時，整個流程會自動執行，並生成對應的 Markdown 文件。