Day 12｜資料管線魔法初探：讓 AI 系統每天自動抓論文（上）

2025 iThome 鐵人賽

DAY 12

AI & Data

論文流浪記：我與AI 探索工具、組合流程、挑戰完整平台系列第 13 篇

17th鐵人賽

冒牌者症候群的軟體攻城獅

團隊等待阿毛參賽中

2025-09-19 13:33:00

96 瀏覽

分享至

⚡《AI 知識系統建造日誌》這不是一篇純技術文章，而是一場工程師的魔法冒險。程式是咒語、流程是魔法陣、錯誤訊息則是黑暗詛咒。請準備好你的魔杖（鍵盤），今天，我們要踏入魔法學院的基礎魔法課，打造穩定、可擴展的 AI 知識系統。

前言

上一篇文章，我們建立了完美但空的基礎設施。PostgreSQL、向量資料庫、Docker 等都已就位，但資料庫裡沒有任何資料。漂亮的舞台，卻沒有任何演員上場。這一刻，對每位 AI 工程師來說都很熟悉：基礎打好了，但資料還沒流進系統。

這篇文章，我們要把這個空舞台變成一個自動化的研究論文資料管線，讓 AI 知識系統每天自動抓取、處理、存儲資料，真正活起來。

為什麼資料管線至關重要

所有成功 AI 系統背後都有穩健的資料管線：

Google 搜尋：每天抓取並處理整個網路
Netflix 推薦：實時處理數百萬用戶的觀看數據
ChatGPT：處理數 TB 文本，保證資料品質
沒有可靠的資料，算法再強也無用。如同

核心挑戰

空資料庫：基礎設施完美，但資料零。
資料管線的痛點：
- API 限流（rate limit）
- PDF 格式複雜，容易解析錯誤
- 網路錯誤、下載損壞
資料工程佔比超高：實際上 AI 系統 80~90% 的複雜度來自可靠的資料流。

建立的資料管線架構


arxiv_ingestion/
├── readme.md                  # Project overview and setup instructions
├── deploy_flows.sh             # deploy flow
├── prefect_entrypoint.py       #  flow.serve 註冊 flow 並套用 schedule
├── arxiv_pipeline.py      # Main pipeline for fetching, processing, and storing ArXiv papers
├── config.py                  # Configuration settings (e.g., API URLs, collection names, environment variables)
├── logger.py                  # Centralized logging utilities
├── exceptions.py              # Custom exception classes for error handling
│
├── db/                        # Database and storage layer
│   ├── PaperRepository.py     # Repository for paper-related CRUD operations
│   ├── factory.py             # Database session/factory creation
│   ├── minio.py               # MinIO client setup for storing PDFs
│   ├── models.py              # ORM models for entities (Paper, User, etc.)
│   └── qdrant.py              # Qdrant client setup and utilities for vector DB
│
├── services/                  # External services and processing modules
│   ├── arxiv_client.py        # Client for querying the ArXiv API
│   ├── embedding.py           # Embedding utilities for converting text into vectors
│   ├── docling.py             # PDF parsing utilities (Docling integration)
│   ├── metadata_fetcher.py    # Extracting and normalizing metadata from ArXiv papers
│   ├── pdf_parser.py          # Parsing PDFs to extract raw text or structured sections
│   └── schemas.py             # Pydantic schemas and data models
│
└── tasks/                     # Prefect tasks, modular building blocks for the pipeline
    ├── fetch_papers.py        # Task to fetch papers from ArXiv
    ├── generate_report.py     # Task to generate summary reports
    ├── process_pdfs.py        # Task to parse and extract text from PDFs
    └── qdrant_index.py        # Task to index paper chunks into Qdrant

Layer-by-Layer Breakdown

Pipeline Layer (arxiv_pipeline.py)

Orchestrates the full workflow for fetching, processing, and storing ArXiv papers
Coordinates Prefect tasks to ensure sequential execution and dependency management
Handles integration points between data retrieval, processing, embedding, and storage

Tasks Layer (tasks/)

Modular building blocks for the pipeline
fetch_papers.py: Fetches newly published papers from ArXiv
generate_report.py: Generates summary reports from paper content
process_pdfs.py: Parses PDF files to extract text or structured sections
qdrant_index.py: Indexes paper chunks into Qdrant vector database for retrieval

Services Layer (services/)

Implements core business logic and processing utilities
arxiv_client.py: Queries ArXiv API for paper metadata and PDFs
embedding.py: Generates embeddings from paper text for semantic search
docling.py: PDF parsing utilities and extraction logic
metadata_fetcher.py: Extracts and normalizes metadata from papers
pdf_parser.py: Converts PDFs into raw text or structured content
schemas.py: Pydantic models for validation and type safety

Database & Storage Layer (db/)

Handles data persistence and storage management
PaperRepository.py: CRUD operations for paper entities
factory.py: Database session creation and management
minio.py: MinIO client setup for storing PDF files
models.py: ORM models representing Papers, Users, and related entities
qdrant.py: Qdrant client and utilities for storing embeddings and enabling semantic search

Configuration & Utilities (config.py, logger.py, exceptions.py)

config.py: Centralized configuration for API URLs, collection names, and environment variables
logger.py: Logging utilities for structured logging across the pipeline
exceptions.py: Custom exception classes for centralized error handling