[Day15]找重複檔案 Duplicate Finder：大小 + 雜湊比對，輸出 CSV，附刪除腳本

python

eyeyeyeye 2025-10-02 16:18:36 ‧ 121 瀏覽

分享至

備份多次、下載一堆資源、照片同步後常會出現重複檔。
今天這支工具會：
以檔案大小 → 內容雜湊兩階段找重複（避免全目錄硬算雜湊太慢）
支援副檔名過濾（.pdf、.jpg…）、遞迴子資料夾、最小檔案大小
輸出 CSV 報表（群組、保留哪一份、大小、雜湊、路徑）
可額外產生 PowerShell 刪除腳本，可以確認後再執行，避免誤刪

環境／安裝
完全使用標準庫，不用裝任何套件。

程式碼（存成 dup_finder.py）

# dup_finder.py — Day 15：找重複檔案（大小 + 雜湊），輸出 CSV，產生刪除腳本（不自動刪）
from __future__ import annotations
import argparse, csv, hashlib
from pathlib import Path
from typing import Dict, List, Iterable

def iter_files(src: Path, recursive: bool, patterns: List[str] | None) -> Iterable[Path]:
    pats = patterns or ["*"]
    seen = set()
    for pat in pats:
        globber = src.rglob if recursive else src.glob
        for p in globber(pat):
            if p.is_file():
                rp = p.resolve()
                if rp not in seen:
                    seen.add(rp)
                    yield p

def hash_file(p: Path, algo: str = "sha1", chunk_size: int = 1024 * 1024) -> str:
    h = hashlib.new(algo)
    with p.open("rb") as f:
        for chunk in iter(lambda: f.read(chunk_size), b""):
            h.update(chunk)
    return h.hexdigest()

def write_csv(rows: List[dict], out: Path):
    out.parent.mkdir(parents=True, exist_ok=True)
    cols = ["group_id","keep","path","size","hash"]
    with out.open("w", encoding="utf-8", newline="") as f:
        w = csv.DictWriter(f, fieldnames=cols)
        w.writeheader()
        for r in rows: w.writerow(r)

def make_delete_ps1(groups: Dict[str, List[Path]], out_ps1: Path):
    out_ps1.parent.mkdir(parents=True, exist_ok=True)
    with out_ps1.open("w", encoding="utf-8") as f:
        f.write("# 這是自動產生的刪除重複檔案腳本，請確認後再執行！\n")
        f.write("$ErrorActionPreference = 'Stop'\n\n")
        gid = 0
        for h, files in groups.items():
            if len(files) < 2: continue
            gid += 1
            keep = files[0]
            f.write(f"# Group {gid}  保留：{keep}\n")
            for dup in files[1:]:
                f.write(f"Remove-Item -LiteralPath '{dup}' -Force\n")
            f.write("\n")

def main():
    ap = argparse.ArgumentParser(description="找重複檔案（大小 + 雜湊），輸出 CSV，並可產生 PowerShell 刪除腳本")
    ap.add_argument("--src", type=Path, required=True, help="來源資料夾")
    ap.add_argument("--recursive", action="store_true", help="包含子資料夾")
    ap.add_argument("--match", nargs="*", help="檔名過濾，例如 '*.pdf' '*.jpg'")
    ap.add_argument("--min-size", type=int, default=1, help="只檢查 >= 此大小的檔（位元組）")
    ap.add_argument("--algo", choices=["md5","sha1","sha256"], default="sha1", help="雜湊演算法")
    ap.add_argument("--chunk-size", type=int, default=1024*1024, help="雜湊分塊大小（位元組）")
    ap.add_argument("--out", type=Path, default=Path("exports/dups.csv"), help="輸出 CSV")
    ap.add_argument("--make-delete-script", action="store_true", help="額外產生 PowerShell 刪除腳本 exports/delete_dups.ps1")
    args = ap.parse_args()

    files = [p for p in iter_files(args.src, args.recursive, args.match) if p.stat().st_size >= args.min_size]
    if not files:
        print("找不到檔案（檢查 --match 或路徑）")
        return

    # Step 1: 先用大小分組
    by_size: Dict[int, List[Path]] = {}
    for p in files:
        by_size.setdefault(p.stat().st_size, []).append(p)

    # Step 2: 對於大小相同、數量>1 的群組才計算雜湊
    dup_groups: Dict[str, List[Path]] = {}
    for size, group in by_size.items():
        if len(group) < 2: continue
        for p in group:
            h = hash_file(p, args.algo, args.chunk_size)
            key = f"{size}:{h}"
            dup_groups.setdefault(key, []).append(p)

    # 匯出 CSV
    rows: List[dict] = []
    group_id = 0
    real_groups: Dict[str, List[Path]] = {}
    for key, files in dup_groups.items():
        if len(files) < 2: continue
        group_id += 1
        size, h = key.split(":", 1)
        keep = files[0]
        real_groups[key] = files
        rows.append({"group_id": group_id, "keep": "YES", "path": str(keep), "size": size, "hash": h})
        for dup in files[1:]:
            rows.append({"group_id": group_id, "keep": "", "path": str(dup), "size": size, "hash": h})

    write_csv(rows, args.out)
    print(f"✅ 已輸出 CSV：{args.out}（群組數：{group_id}）")

    if args.make_delete_script and group_id > 0:
        ps1 = Path("exports/delete_dups.ps1")
        make_delete_ps1(real_groups, ps1)
        print(f"🧹 已產生刪除腳本（不自動刪，請審閱後手動執行）：{ps1}")

if __name__ == "__main__":
    main()

怎麼用
提醒：PowerShell 中的萬用字元要加引號，例如 '* .pdf'

掃整個資料夾（含子資料夾），找 PDF 重複

python .\dup_finder.py --src . --recursive --match '*.pdf' --out .\exports\dups_pdf.csv
ii .\exports   # 打開看結果

掃常見圖片格式，跳過超小檔（小於 10KB 不算）
python .\dup_finder.py --src . --recursive --match '*.jpg' '*.jpeg' '*.png' --min-size 10240 --out .\exports\dups_images.csv
想要更嚴謹就換成 SHA-256（較慢，但碰撞更低）

python .\dup_finder.py --src . --recursive --match '*.*' --algo sha256 --out .\exports\dups_all.csv

產生刪除腳本（不自動刪）
執行後會在 exports\delete_dups.ps1 產生一支 PowerShell 腳本，保留每組第一個檔，刪掉同組其他檔。

python .\dup_finder.py --src . --recursive --match '*.pdf' --make-delete-script
ii .\exports

要刪除前請務必打開 delete_dups.ps1 逐行確認。其內容像這樣：

# Group 1  保留：D:\Docs\report.pdf
Remove-Item -LiteralPath 'D:\Docs\Copy (2) of report.pdf' -Force
Remove-Item -LiteralPath 'D:\Old\report (1).pdf' -Force

若執行腳本遇到「執行原則」限制，可先執行（限當前視窗）：

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\exports\delete_dups.ps1

實作:
![https://ithelp.ithome.com.tw/upload/images/20250930/20169368aQtdlfu09g.png](https://ithelp.ithome.com.tw/upload/images/20250930/20169368aQtdlfu09g.png)

CSV 欄位說明
group_id：同 hash/大小的群組編號
keep：該組預設保留的檔（第一個）標記 YES
path：檔案完整路徑
size：位元組
hash：內容雜湊（同一組相同）
為什麼要「大小 → 雜湊」兩階段？
若一開始就幫所有檔案算雜湊，非常耗時
多數檔案大小不同就不用比；只對「大小相同且數量>1」的群組計算雜湊 → 快很多

小技巧
找不到檔案：先用 Get-ChildItem -Recurse -Filter *.pdf | Select-Object -First 10 確認目錄內真的有符合 pattern 的檔
速度很慢：先把 --match 限縮類型，或提高 --min-size，或把 --chunk-size 調大（如 2MB）
外接硬碟/網路磁碟：速度受裝置 I/O 影響大，耐心等
保留策略：目前預設拿每組第一個當保留；有偏好規則（例如「保留最新日期」）可以再改版

今日小結
我們做了一支安全的 Duplicate Finder：先分組再雜湊，給你清楚的 CSV 報表與可審閱的刪除腳本
完整離線、零相依、PowerShell 友好