Day 28 – 綜合練習：爬蟲 + GUI 小工具

17th鐵人賽

chloeeee

團隊新手小黑

2025-08-30 13:53:37

148 瀏覽

分享至

今天的學習重點

使用 requests：取得網頁 HTML
用 BeautifulSoup：解析 HTML 標籤
用 tkinter：建立 GUI 介面（輸入框、按鈕、標籤）
三者結合 → 做成小工具

程式實作

import requests
from bs4 import BeautifulSoup
import tkinter as tk

def fetch_title():
    url = entry.get()
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/114.0.0.0 Safari/537.36"
    }
    try:
        response = requests.get(url, headers=headers, timeout=5)
        response.raise_for_status()  # 如果有錯誤狀況會拋出例外
        soup = BeautifulSoup(response.text, "html.parser")
        title = soup.title.string if soup.title else "找不到標題"
        label_result.config(text=f"標題：{title}")
    except Exception as e:
        label_result.config(text=f"錯誤：{e}")

# 建立視窗
window = tk.Tk()
window.title("小爬蟲工具")
window.geometry("500x200")

# 輸入框
tk.Label(window, text="請輸入網址：").pack(pady=5)
entry = tk.Entry(window, width=50)
entry.pack(pady=5)

# 按鈕
button = tk.Button(window, text="抓取標題", command=fetch_title)
button.pack(pady=5)

# 結果
label_result = tk.Label(window, text="標題：", wraplength=480, justify="left")
label_result.pack(pady=10)

window.mainloop()

輸入網址
點擊「抓取標題」
下方會顯示 Welcome to Python.org

抓取所有超連結並列出來

我們可以用 soup.find_all("a")，然後把結果顯示在 GUI 的 Text 區塊中。

import requests
from bs4 import BeautifulSoup
import tkinter as tk

def fetch_links():
    url = entry.get()
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/114.0.0.0 Safari/537.36"
    }
    try:
        response = requests.get(url, headers=headers, timeout=5)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")
        links = soup.find_all("a")

        text_box.delete("1.0", tk.END)  # 清空舊資料
        for link in links:
            href = link.get("href")
            if href:
                text_box.insert(tk.END, f"{link.text.strip()} → {href}\n")
    except Exception as e:
        text_box.delete("1.0", tk.END)
        text_box.insert(tk.END, f"錯誤：{e}")

# GUI
window = tk.Tk()
window.title("抓取超連結工具")
window.geometry("600x400")

tk.Label(window, text="請輸入網址：").pack()
entry = tk.Entry(window, width=50)
entry.pack(pady=5)

btn = tk.Button(window, text="抓取超連結", command=fetch_links)
btn.pack(pady=5)

text_box = tk.Text(window, wrap="word", height=15)
text_box.pack(padx=10, pady=10, fill="both", expand=True)

window.mainloop()

輸入網址，會列出該網頁所有 a 連結與文字
螢幕擷取畫面 2025-08-30 130715

學習心得

今天把原本分散的功能（爬網頁、解析 HTML、顯示 GUI）放在一起，就變成一個應用程式了!本來想抓取新聞標題，但都抓不到。查了一下才發現像 BBC / CNN 這些新聞大站，通常會加很多反爬蟲機制（Cloudflare 驗證、JavaScript 動態載入），所以 requests + BeautifulSoup 抓到的可能是「空頁面」或「驗證頁」。
法律 / 規範提醒
不是所有網站都能隨便爬。要遵守對方網站的 robots.txt 規範。
一般公開的大學課程頁、研究室網頁，只要不是會員登入區，通常允許瀏覽，也常常可以用爬蟲學習。但商業新聞網站（BBC / CNN）會限制，甚至封鎖 request。
害我突然有點擔心這幾天沒有事先查好，爬了一些抓不到資料的網頁。我應該不會被抓走吧?
明天要進入「期末專案準備」：規劃一個「迷你網頁爬蟲架構」，把這幾天學的模組組合成一個完整專案!