今天我們不聊算法,也不聊推薦系統,我們聊真正讓系統活起來的幕后英雄——資料。
再強的AI 模型,也只能靠資料 —— 也就是每天乖乖整理、解析、結構化的 PDF 資料。
garbage in garbage out
Docling 的使命,是把 PDF 轉成 結構化資料,方便系統使用。它原生輸出通常是 Python 字典或 Parser 物件,包含:
JSON Key | 意義 | 範例 / 說明 |
---|---|---|
origin |
PDF 基本資訊(檔名、MIME、hash) | { "filename": "2408.09869v5.pdf", "mimetype": "application/pdf" } |
body |
文章主體,透過 children 參照文字/表格/圖片 |
{ "cref": "#/texts/0" } |
texts |
文字片段(段落、章節標題、caption) | label: "section_header", text: "Introduction" |
pictures |
圖片與圖表(含頁碼、bbox、caption) | label: "picture", page_no: 1 |
tables |
表格(含 cell data、caption、頁碼) | label: "table", page_no: 5 |
groups |
章節或段落群組(聚合多個 text/picture/table) | children: [{"cref": "#/texts/13"}] |
pages |
每頁資訊(大小、頁碼) | { "1": { "size": { "width": 612, "height": 792 }, "page_no": 1 } } |
{
"schema_name": "DoclingDocument",
"version": "1.7.0",
"name": "2408.09869v5",
"origin": {...},
"body": {...},
"groups": [...],
"texts": [...],
"pictures": [...],
"tables": [...],
"pages": {...}
}
這些就是系統能用的「燃料」,但我們專案會基於他去封裝成 Python class。
我們來看實際例子
from docling.document_converter import DocumentConverter
import json
source = "https://arxiv.org/pdf/2408.09869" # file path or URL
converter = DocumentConverter()
doc = converter.convert(source).document
doc_dict = doc.model_dump() # Pydantic 2.x 用 model_dump()
doc_json = json.dumps(doc_dict, indent=2, ensure_ascii=False)
with open("docling_output.json", "w", encoding="utf-8") as f:
f.write(doc_json)
{
"schema_name": "DoclingDocument",
"version": "1.7.0",
"name": "2408.09869v5",
"origin": {
"mimetype": "application/pdf",
"binary_hash": 11465328351749295394,
"filename": "2408.09869v5.pdf",
"uri": null
},
"furniture": {
"self_ref": "#/furniture",
"parent": null,
"children": [],
"content_layer": "furniture",
"name": "_root_",
"label": "unspecified"
},
"body": {
"self_ref": "#/body",
"parent": null,
"children":[
],
"content_layer": "body",
"name": "_root_",
"label": "unspecified"
},
"groups":[],
"texts":[...],
"pictures":[...],
"tables":[...],
"key_value_items": [],
"form_items": [],
"pages":{
"1": {
"size": {
"width": 612.0,
"height": 792.0
},
"image": null,
"page_no": 1
},
}
在專案中,我們通常會把 Docling 的原始資料封裝成 Python class,方便操作、序列化或存入資料庫:
PdfContent(
sections=[...],
tables=[...],
figures=[...],
parser_type=ParserType.DOCLING
)
好處:
.json()
PaperSection(
title="Introduction",
content="This is the text of the introduction...",
page_number=1
)
PaperTable(
caption="Table 1", # 表格標題
data=[ # 二維陣列
["Header1", "Header2"],
["Value1", "Value2"]
],
page_number=2 # 可選:所在頁
)
PaperFigure(
caption="Figure 1", # 圖片標題
image_data=b"...", # 二進位圖片資料 (通常是 PNG/JPG bytes)
page_number=3
)
表示使用的解析器類型,對應 ParserType.DOCLING
✅ 結論:Docling 提供了原始結構化資料,專案再封裝成 Python class 只是讓資料更易於後續分析、NLP 或 RAG 系統使用。
以下是實際專案中程式碼,供參考使用
class PdfContent(BaseModel):
"""PDF-specific content extracted by parsers like Docling."""
sections: List[PaperSection] = Field(
default_factory=list, description="Paper sections"
)
figures: List[PaperFigure] = Field(default_factory=list, description="Figures")
tables: List[PaperTable] = Field(default_factory=list, description="Tables")
raw_text: str = Field(..., description="Full extracted text")
references: List[str] = Field(default_factory=list, description="References")
parser_used: ParserType = Field(..., description="Parser used for extraction")
metadata: Dict[str, Any] = Field(
default_factory=dict, description="Parser metadata"
)
class PaperSection(BaseModel):
"""Represents a section of a paper."""
title: str = Field(..., description="Section title")
content: str = Field(..., description="Section content")
level: int = Field(default=1, description="Section hierarchy level")
class PaperFigure(BaseModel):
"""Represents a figure in a paper."""
caption: str = Field(..., description="Figure caption")
id: str = Field(..., description="Figure identifier")
class PaperTable(BaseModel):
"""Represents a table in a paper."""
caption: str = Field(..., description="Table caption")
id: str = Field(..., description="Table identifier")
class ParserType(str, Enum):
"""PDF parser types."""
DOCLING = "docling"