【Day4】資料的來源：從台灣政府公開資料集獲取旅遊資訊

2025 iThome 鐵人賽

DAY 4

生成式 AI

智慧旅遊顧問--結合 LLM 與 RAG 架構的台灣旅遊資訊助手系列第 4 篇

17th鐵人賽

terrylin0505

2025-09-18 22:47:43

129 瀏覽

分享至

政府公開資料集提供了大量經過整理、結構化的資訊，這些資料通常以XML或CSV格式釋出，具有極高的可靠性。這類資料集是建立RAG知識庫的絕佳起點，能為我的AI顧問紮下牢固的基礎。

一、如何尋找與下載資料集

進入資料平台：台灣資料開放服務平台
關鍵字搜尋：在搜尋欄中輸入關鍵字，例如旅遊、觀光、景點、美食等。
篩選格式：在搜尋結果頁面，根據檔案類型進行篩選，我要找的是可以直接在程式中讀取的結構化資料，例如XML或CSV格式。
下載資料：點擊感興趣的資料集。通常資料集頁面會提供下載檔案或API網址。

二、解析XML檔案

　　假設我找到一個Trip.xml的旅遊資料集，使用Python內建的 xml.etree.ElementTree來解析這個檔案，並提取出我需要的景點資訊。

確保已經下載了XML檔案並放在專案目錄中。

import xml.etree.ElementTree as ET
file_path = './ChiyaTrip.xml'

try:
    tree = ET.parse(file_path)
    root = tree.getroot()
    spots = root.findall('.//spot')
    tourism_data = []
    for spot in spots:
        name = spot.find('Name').text if spot.find('Name') is not None else 'N/A'
        description = spot.find('Description').text if spot.find('Description') is not None else 'N/A'
        address = spot.find('Address').text if spot.find('Address') is not None else 'N/A'

        tourism_data.append({
            'name': name,
            'description': description,
            'address': address
        })

    print(f"成功解析 {len(tourism_data)} 筆資料，前三筆如下：\n")
    for item in tourism_data[:3]:
        print(f"景點名稱: {item['name']}")
        print(f"景點描述: {item['description'][:50]}...") 
        print(f"地址: {item['address']}\n")

這段程式碼會先解析XML檔案並且尋找所有名為spot的元素，建立一個列表儲存資料，從子元素中提取資料，使用 .find().text 安全地取得內容，將提取的資料儲存為字典格式，最後印出前三筆資料作為驗證。