[Day23] Captcha識別技術

2024 iThome 鐵人賽

DAY 2

自我挑戰組

30天認識爬蟲系列第 23 篇

16th鐵人賽

eyeyeyeye

2024-10-09 17:08:27

94 瀏覽

分享至

今天是第二十三天，我的目標理解CAPTCHA的工作原理並學會如何使用Python進行識別。
需要用到的工具:

python 3
pytesseract（用於 OCR）
Pillow（圖像處理）
requests（用於獲取 CAPTCHA 圖片）

*1.安裝所需的庫，首先確保已安裝必要的庫：

pip install pytesseract Pillow requests

2.安裝Tesseract OCR

Windows：可以從 Tesseract at UB Mannheim 下載並安裝 Tesseract。
macOS：可以使用 Homebrew 安裝：

brew install tesseract

Linux：可以使用包管理器安裝：

sudo apt-get install tesseract-ocr

安裝後，確保將 Tesseract 的安裝路徑添加到環境變量中。
3.編寫識別，CAPTCHA 的腳本下面是一個簡單的Python腳本，用於識別CAPTCHA圖片：

import requests
from PIL import Image
import pytesseract
from io import BytesIO

# 設定 CAPTCHA 圖片的 URL
captcha_url = 'YOUR_CAPTCHA_IMAGE_URL'

def download_captcha(url):
    response = requests.get(url)
    if response.status_code == 200:
        return Image.open(BytesIO(response.content))
    else:
        print('Failed to retrieve CAPTCHA image')
        return None

def recognize_captcha(captcha_image):
    # 使用 pytesseract 進行 OCR 識別
    captcha_text = pytesseract.image_to_string(captcha_image)
    return captcha_text.strip()

# 主函數
if __name__ == '__main__':
    captcha_image = download_captcha(captcha_url)
    if captcha_image:
        captcha_image.show()  # 顯示 CAPTCHA 圖片
        captcha_text = recognize_captcha(captcha_image)
        print(f'Recognized CAPTCHA: {captcha_text}')

4.執行腳本，將上面的代碼保存在一個 Python 文件中（例如 captcha_recognizer.py），然後在終端中運行：