⚡《AI 知識系統建造日誌》這不是一篇純技術文章,而是一場工程師的魔法冒險。程式是咒語、流程是魔法陣、錯誤訊息則是黑暗詛咒。請準備好你的魔杖(鍵盤),今天,我們要踏入魔法學院的基礎魔法課,打造穩定、可擴展的 AI 知識系統。
上一篇文章,我們建立了完美但 空的基礎設施。PostgreSQL、向量資料庫、Docker 等都已就位,但資料庫裡沒有任何資料。漂亮的舞台,卻沒有任何演員上場。這一刻,對每位 AI 工程師來說都很熟悉:基礎打好了,但資料還沒流進系統。
這篇文章,我們要把這個空舞台變成一個 自動化的研究論文資料管線,讓 AI 知識系統每天自動抓取、處理、存儲資料,真正活起來。
所有成功 AI 系統背後都有穩健的資料管線:
arxiv_ingestion/
├── readme.md # Project overview and setup instructions
├── deploy_flows.sh # deploy flow
├── prefect_entrypoint.py # flow.serve 註冊 flow 並套用 schedule
├── arxiv_pipeline.py # Main pipeline for fetching, processing, and storing ArXiv papers
├── config.py # Configuration settings (e.g., API URLs, collection names, environment variables)
├── logger.py # Centralized logging utilities
├── exceptions.py # Custom exception classes for error handling
│
├── db/ # Database and storage layer
│ ├── PaperRepository.py # Repository for paper-related CRUD operations
│ ├── factory.py # Database session/factory creation
│ ├── minio.py # MinIO client setup for storing PDFs
│ ├── models.py # ORM models for entities (Paper, User, etc.)
│ └── qdrant.py # Qdrant client setup and utilities for vector DB
│
├── services/ # External services and processing modules
│ ├── arxiv_client.py # Client for querying the ArXiv API
│ ├── embedding.py # Embedding utilities for converting text into vectors
│ ├── docling.py # PDF parsing utilities (Docling integration)
│ ├── metadata_fetcher.py # Extracting and normalizing metadata from ArXiv papers
│ ├── pdf_parser.py # Parsing PDFs to extract raw text or structured sections
│ └── schemas.py # Pydantic schemas and data models
│
└── tasks/ # Prefect tasks, modular building blocks for the pipeline
├── fetch_papers.py # Task to fetch papers from ArXiv
├── generate_report.py # Task to generate summary reports
├── process_pdfs.py # Task to parse and extract text from PDFs
└── qdrant_index.py # Task to index paper chunks into Qdrant
Pipeline Layer (arxiv_pipeline.py
)
Tasks Layer (tasks/
)
fetch_papers.py
: Fetches newly published papers from ArXivgenerate_report.py
: Generates summary reports from paper contentprocess_pdfs.py
: Parses PDF files to extract text or structured sectionsqdrant_index.py
: Indexes paper chunks into Qdrant vector database for retrievalServices Layer (services/
)
arxiv_client.py
: Queries ArXiv API for paper metadata and PDFsembedding.py
: Generates embeddings from paper text for semantic searchdocling.py
: PDF parsing utilities and extraction logicmetadata_fetcher.py
: Extracts and normalizes metadata from paperspdf_parser.py
: Converts PDFs into raw text or structured contentschemas.py
: Pydantic models for validation and type safetyDatabase & Storage Layer (db/
)
PaperRepository.py
: CRUD operations for paper entitiesfactory.py
: Database session creation and managementminio.py
: MinIO client setup for storing PDF filesmodels.py
: ORM models representing Papers, Users, and related entitiesqdrant.py
: Qdrant client and utilities for storing embeddings and enabling semantic searchConfiguration & Utilities (config.py
, logger.py
, exceptions.py
)
config.py
: Centralized configuration for API URLs, collection names, and environment variableslogger.py
: Logging utilities for structured logging across the pipelineexceptions.py
: Custom exception classes for centralized error handling必須精確遵守 3 秒間隔。
機制:
傳統 PDF 解析庫失敗率高(數學公式、表格、雙欄排版)。
選擇 Docling:
細節可參考 Day X|資料才是英雄——Docling 的 PDF 解析秘笈 📄🛡️
每天自動抓取最新論文
四階段流程:
特色:
挑戰 | 解法 | 工具 |
---|---|---|
arXiv 限流 | 智能延遲 + 指數退避 | Python + asyncio |
PDF 複雜格式 | 保留結構化資料 | Docling |
每日自動化 | Flow + Schedule | Prefect |
我們把空的基礎設施變成了一個 生產級資料管線:自動抓取、解析、存儲、監控,為後續的資料搜尋和 AI 知識生成奠定了堅實基礎。模組化架構、異步處理、錯誤處理和可靠性,是打造真正可運行系統的關鍵。