[Day5]清理爬蟲結果，讓資料更乾淨

python

eyeyeyeye 2025-09-26 10:48:32 ‧ 185 瀏覽

分享至

昨天我們已經能夠成功把網站的標題與連結抓下來，甚至存成 CSV 檔案。但實際爬下來的結果，往往會有一些問題：
有些標題是空的或只有符號
有些連結是重複的
有些標題太長，不方便閱讀
今天我們要在程式中加入資料清理（Data Cleaning）的步驟，讓輸出的結果更乾淨。

修改程式碼
在 crawl_titles.py 裡，找到處理結果的部分，並加上以下規則：

def clean_results(pairs, allowed, limit):
    seen = set()
    cleaned = []
    for text, link in pairs:
        # 1. 過濾掉非 http(s) 的連結
        if not link.startswith(("http://", "https://")):
            continue

        # 2. 過濾掉不在允許網域內的連結
        if not same_domain(link, allowed):
            continue

        # 3. 清理標題：去掉前後空白，只保留前 50 個字
        text = text.strip()
        if not text:
            continue
        if len(text) > 50:
            text = text[:50] + "..."

        # 4. 去重複（相同文字+網址的組合）
        key = (text, link)
        if key in seen:
            continue
        seen.add(key)

        cleaned.append({"text": text, "url": link})
        if len(cleaned) >= limit:
            break
    return cleaned

然後在 main() 中，改成：

pairs = extract_links(html, start)
cleaned = clean_results(pairs, allowed, args.limit)

測試程式

python crawl_titles.py https://ithelp.ithome.com.tw/ --allow ithelp.ithome.com.tw --limit 20 --out clean_links.csv --insecure

實作:

你會發現：
標題不會再出現空的或全是符號的項目
標題超過 50 字會被截斷，加上 ...
重複的連結只會保留一次

今日重點
資料清理在爬蟲中很重要，可以大幅提升資料品質
你可以依需求調整清理規則，例如限制字數、排除特定網址

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19866 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

[Day5]清理爬蟲結果，讓資料更乾淨

尚未有邦友留言

標記使用者