DAY 20: 基礎網頁爬蟲與數據自動化的實踐

2024 iThome 鐵人賽

DAY 20

Python

Python探索之旅：從基礎到實踐系列第 20 篇

16th鐵人賽

團隊資工之花

2024-10-04 15:48:18

326 瀏覽

分享至

隨著數據驅動時代的來臨，網頁爬蟲和自動化數據處理成為開發者不可或缺的技能。今天，我們將深入探討如何使用 BeautifulSoup 進行簡單的網頁數據抓取，結合 API 整合來獲取並處理數據，同時編寫自動化腳本來處理常見的檔案操作，如 JSON 和 CSV 格式的解析。

使用 BeautifulSoup 進行簡單的網頁數據抓取

BeautifulSoup 是一個強大的網頁解析工具，適合用來從 HTML 網頁中提取結構化的數據。以下是如何使用它來進行簡單的網頁數據抓取的範例：

安裝必要的模組：

pip install beautifulsoup4 requests

發送 HTTP 請求並解析網頁內容：

import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    # 抓取特定標籤內容
    titles = soup.find_all('h2')
    for title in titles:
        print(title.text)

這段程式碼將抓取網頁中所有 <h2> 標籤的文本內容，並輸出它們的值。此方法適用於靜態網頁抓取簡單數據。

API 整合

有時候，我們需要從網頁中抓取到的數據並不足夠，這時候 API 就派上了用場。API 提供了一種結構化的方式來與伺服器進行數據交換，讓我們能夠精準地請求並獲得所需的資料。

認識 API：

API（應用程式介面）是一個供程式請求數據的入口。許多網站提供 API 供開發者使用，如天氣預報、股票價格等資料。

使用 Python 進行 API 請求：

import requests

api_url = 'https://api.example.com/data'
params = {'key': 'YOUR_API_KEY'}
response = requests.get(api_url, params=params)

if response.status_code == 200:
    data = response.json()  # 解析 JSON 數據
    print(data)

這段程式碼會向 API 發送請求，並將返回的 JSON 數據解析為 Python 字典，供後續處理。

自動化腳本

自動化腳本可以幫助我們節省時間與精力，讓一些日常工作自動化進行。例如，以下是如何撰寫一個簡單的自動化腳本來自動發送郵件和處理文件：

自動發送郵件：

import smtplib
from email.mime.text import MIMEText

def send_email(subject, body, to_email):
    smtp_server = "smtp.example.com"
    smtp_port = 587
    username = "your_email@example.com"
    password = "your_password"

    msg = MIMEText(body)
    msg['Subject'] = subject
    msg['From'] = username
    msg['To'] = to_email

    server = smtplib.SMTP(smtp_server, smtp_port)
    server.starttls()
    server.login(username, password)
    server.sendmail(username, to_email, msg.as_string())
    server.quit()

send_email("測試主題", "這是郵件內容", "recipient@example.com")

此範例展示了如何撰寫自動化腳本來發送電子郵件，從而實現簡單的自動化流程。

自動處理文件：

import os

def rename_files_in_directory(directory):
    for filename in os.listdir(directory):
        new_filename = filename.replace(" ", "_")
        os.rename(os.path.join(directory, filename), os.path.join(directory, new_filename))

rename_files_in_directory("/path/to/directory")

這個簡單的腳本將會遍歷指定資料夾中的所有檔案，並將檔名中的空格替換為底線，實現自動檔名處理。

檔案應用：JSON 和 CSV 格式的操作

除了自動化數據抓取與處理，檔案的格式解析也是自動化流程中非常重要的一環。以下是如何使用 Python 操作 JSON 和 CSV 檔案的範例：

讀取與寫入 JSON 檔案：

import json

# 讀取 JSON 檔案
with open('data.json', 'r') as file:
    data = json.load(file)

# 修改數據
data['new_key'] = '新值'

# 寫入 JSON 檔案
with open('data.json', 'w') as file:
    json.dump(data, file, indent=4)

讀取與寫入 CSV 檔案：

import csv

# 讀取 CSV 檔案
with open('data.csv', 'r') as file:
    reader = csv.reader(file)
    for row in reader:
        print(row)

# 寫入 CSV 檔案
with open('output.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['名稱', '值'])
    writer.writerow(['項目1', 123])

這些範例展示了如何操作常見的數據格式，並能與自動化流程結合，進行數據的解析與處理。

專題實作：IP 位址查詢自動化

結合我們今天所學的技術，這裡將展示如何自動化 IP 地址查詢。這可以通過使用 API 來查詢 IP 位址的詳細信息，並將結果保存到 CSV 檔案中。

import requests
import csv

def fetch_ip_info(ip):
    api_url = f'https://ipinfo.io/{ip}/json'
    response = requests.get(api_url)

    if response.status_code == 200:
        return response.json()
    else:
        return None

def save_ip_info_to_csv(ip_info):
    with open('ip_info.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(['IP', '城市', '地區', '國家', 'ISP'])
        writer.writerow([ip_info['ip'], ip_info['city'], ip_info['region'], ip_info['country'], ip_info['org']])

# 測試範例
ip_address = '8.8.8.8'
ip_info = fetch_ip_info(ip_address)
if ip_info:
    save_ip_info_to_csv(ip_info)
    print("IP 資訊已寫入 CSV")
else:
    print("無法取得 IP 資訊")

這段程式會自動發送 API 請求來獲取 IP 地址的地理位置和相關信息，並將其存入 CSV 檔案中，方便後續分析。

透過今天的內容，我們將學習如何進行基礎網頁爬蟲、API 整合、自動化腳本編寫，以及處理 JSON 和 CSV 檔案。這些技能為我們日後進行更加高階的數據抓取與處理打下堅實的基礎。