Day X｜資料才是英雄——Docling 的 PDF 解析秘笈 📄🛡️

2025 iThome 鐵人賽

DAY 3

AI & Data

論文流浪記：我與AI 探索工具、組合流程、挑戰完整平台系列第 4 篇

17th鐵人賽

冒牌者症候群的軟體攻城獅

團隊等待阿毛參賽中

2025-09-10 15:48:57

420 瀏覽

分享至

今天我們不聊算法，也不聊推薦系統，我們聊真正讓系統活起來的幕后英雄——資料。

再強的AI 模型，也只能靠資料 —— 也就是每天乖乖整理、解析、結構化的 PDF 資料。

PDF 解析的挑戰

學術 PDF 不只是文字，它們帶著數學公式、多欄排版、複雜表格和參考文獻。
好資料 = 好系統；爛資料，AI 也只能哭著找媽媽。 😅

garbage in garbage out

Docling 的價值

Docling 官方範例

Docling 的使命，是把 PDF 轉成 結構化資料，方便系統使用。它原生輸出通常是 Python 字典或 Parser 物件，包含：

JSON Key	意義	範例 / 說明
`origin`	PDF 基本資訊（檔名、MIME、hash）	`{ "filename": "2408.09869v5.pdf", "mimetype": "application/pdf" }`
`body`	文章主體，透過 `children` 參照文字/表格/圖片	`{ "cref": "#/texts/0" }`
`texts`	文字片段（段落、章節標題、caption）	`label: "section_header", text: "Introduction"`
`pictures`	圖片與圖表（含頁碼、bbox、caption）	`label: "picture", page_no: 1`
`tables`	表格（含 cell data、caption、頁碼）	`label: "table", page_no: 5`
`groups`	章節或段落群組（聚合多個 text/picture/table）	`children: [{"cref": "#/texts/13"}]`
`pages`	每頁資訊（大小、頁碼）	`{ "1": { "size": { "width": 612, "height": 792 }, "page_no": 1 } }`

{
  "schema_name": "DoclingDocument",
  "version": "1.7.0",
  "name": "2408.09869v5",
  "origin": {...},
  "body": {...},
  "groups": [...],
  "texts": [...],
  "pictures": [...],
  "tables": [...],
  "pages": {...}
}

這些就是系統能用的「燃料」，但我們專案會基於他去封裝成 Python class。

我們來看實際例子

from docling.document_converter import DocumentConverter
import json

source = "https://arxiv.org/pdf/2408.09869"  # file path or URL
converter = DocumentConverter()
doc = converter.convert(source).document
doc_dict = doc.model_dump()  # Pydantic 2.x 用 model_dump()
doc_json = json.dumps(doc_dict, indent=2, ensure_ascii=False)
with open("docling_output.json", "w", encoding="utf-8") as f:
    f.write(doc_json)

{
  "schema_name": "DoclingDocument",
  "version": "1.7.0",
  "name": "2408.09869v5",
  "origin": {
    "mimetype": "application/pdf",
    "binary_hash": 11465328351749295394,
    "filename": "2408.09869v5.pdf",
    "uri": null
  },
  "furniture": {
    "self_ref": "#/furniture",
    "parent": null,
    "children": [],
    "content_layer": "furniture",
    "name": "_root_",
    "label": "unspecified"
  },
  "body": {
    "self_ref": "#/body",
    "parent": null,
    "children":[
    ],
    "content_layer": "body",
    "name": "_root_",
    "label": "unspecified"
  },
  "groups":[],
  "texts":[...],
  "pictures":[...],
  "tables":[...],
  "key_value_items": [],
  "form_items": [],
  "pages":{
   "1": {
      "size": {
        "width": 612.0,
        "height": 792.0
      },
      "image": null,
      "page_no": 1
    },
    }

專案封裝：讓資料更好用、易讀

在專案中，我們通常會把 Docling 的原始資料封裝成 Python class，方便操作、序列化或存入資料庫：

PdfContent(
    sections=[...],
    tables=[...],
    figures=[...],
    parser_type=ParserType.DOCLING
)

好處：

統一操作介面：不管資料來源是 Docling、OCR 還是 pdfplumber，操作方式一致
方便存檔 / JSON 化：Python class 可直接序列化 .json()
強型別操作：操作章節、表格、圖片，更直覺安全

PaperSection(
    title="Introduction",
    content="This is the text of the introduction...",
    page_number=1
)

PaperTable(
    caption="Table 1",      # 表格標題
    data=[                  # 二維陣列
        ["Header1", "Header2"],
        ["Value1", "Value2"]
    ],
    page_number=2           # 可選：所在頁
)

PaperFigure(
    caption="Figure 1",   # 圖片標題
    image_data=b"...",    # 二進位圖片資料 (通常是 PNG/JPG bytes)
    page_number=3
)

表示使用的解析器類型，對應 ParserType.DOCLING

小結

資料是核心：Docling 把 PDF 轉成可用的結構化資料
封裝只是方便：Python class 幫你更方便地操作與存檔
資料好，系統才強：再好的模型也比不上乾淨、結構化的資料

✅ 結論：Docling 提供了原始結構化資料，專案再封裝成 Python class 只是讓資料更易於後續分析、NLP 或 RAG 系統使用。

以下是實際專案中程式碼，供參考使用

class PdfContent(BaseModel):
    """PDF-specific content extracted by parsers like Docling."""

    sections: List[PaperSection] = Field(
        default_factory=list, description="Paper sections"
    )
    figures: List[PaperFigure] = Field(default_factory=list, description="Figures")
    tables: List[PaperTable] = Field(default_factory=list, description="Tables")
    raw_text: str = Field(..., description="Full extracted text")
    references: List[str] = Field(default_factory=list, description="References")
    parser_used: ParserType = Field(..., description="Parser used for extraction")
    metadata: Dict[str, Any] = Field(
        default_factory=dict, description="Parser metadata"
    )
  

class PaperSection(BaseModel):
    """Represents a section of a paper."""

    title: str = Field(..., description="Section title")
    content: str = Field(..., description="Section content")
    level: int = Field(default=1, description="Section hierarchy level")


class PaperFigure(BaseModel):
    """Represents a figure in a paper."""

    caption: str = Field(..., description="Figure caption")
    id: str = Field(..., description="Figure identifier")


class PaperTable(BaseModel):
    """Represents a table in a paper."""

    caption: str = Field(..., description="Table caption")
    id: str = Field(..., description="Table identifier")

class ParserType(str, Enum):
    """PDF parser types."""

    DOCLING = "docling"