Day 9 | 手勢識別 (3)：即時鏡頭辨識手勢，讓模型看見你的動作 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 9

AI & Data

感知你的動作與情緒：深度學習在人機互動的應用系列第 9 篇

Day 9 | 手勢識別 (3)：即時鏡頭辨識手勢，讓模型看見你的動作

17th鐵人賽手勢識別 cnn opencv

minsnow

2025-09-11 23:55:29

184 瀏覽

分享至

前言

上一篇中，我們訓練了一個 CNN 模型，能夠根據靜態手勢圖片辨識出對應的手語字母。但真實世界的互動往往是動態的，例如使用者面對鏡頭比出手勢，期望即時得到辨識結果。

因此今天將整合 MediaPipe Hands 手部關鍵點偵測，讓模型能在 即時鏡頭畫面中擷取手勢區域（ROI）並進行辨識，實作真正的互動式手語辨識。

本篇目標：

使用 OpenCV 擷取鏡頭畫面
使用 MediaPipe 自動抓取手部區域
將手勢影像餵入訓練好的 CNN 模型
將預測結果回饋到畫面中

載入模型與必要套件

先定義與訓練階段相同的 CNN 結構，並載入訓練好的 .pth 權重檔：

import cv2
import mediapipe as mp
import torch
import torchvision.transforms as transforms
from PIL import Image
import numpy as np
import torch.nn as nn

# CNN 架構（需與訓練時一致）
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=24):
        super(SimpleCNN, self).__init__()
        self.cnn = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 3 * 3, 256), nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        return self.fc(self.cnn(x))

載入模型：

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleCNN()
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval()

初始化 MediaPipe Hands 與影像前處理

初始化 MediaPipe Hands：

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=1)
mp_draw = mp.solutions.drawing_utils

定義影像轉換流程（與訓練一致）：

transform = transforms.Compose([
    transforms.Resize((28, 28)),
    transforms.Grayscale(), 
    transforms.ToTensor(),
])

從鏡頭即時辨識手語字母

使用 MediaPipe 偵測手部位置，自動擷取 ROI，並傳入 CNN 模型推論。

cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # 轉換為 RGB 給 MediaPipe 使用
    img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = hands.process(img_rgb)

    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            h, w, _ = frame.shape
            x_list = [lm.x for lm in hand_landmarks.landmark]
            y_list = [lm.y for lm in hand_landmarks.landmark]
            xmin = int(min(x_list) * w)
            xmax = int(max(x_list) * w)
            ymin = int(min(y_list) * h)
            ymax = int(max(y_list) * h)

            # 加 margin 讓手勢不被裁切
            margin = 20
            xmin = max(xmin - margin, 0)
            ymin = max(ymin - margin, 0)
            xmax = min(xmax + margin, w)
            ymax = min(ymax + margin, h)

            # 擷取 ROI 區域並轉成 PIL 圖片
            hand_roi = frame[ymin:ymax, xmin:xmax]
            hand_pil = Image.fromarray(cv2.cvtColor(hand_roi, cv2.COLOR_BGR2RGB))
            input_tensor = transform(hand_pil).unsqueeze(0).to(device)

            # 模型推論
            with torch.no_grad():
                output = model(input_tensor)
                pred = torch.argmax(output, dim=1).item()

                # A~Z（跳過 J 對應 label=9）
                label_char = chr(pred + 65 if pred < 9 else pred + 66)

            # 顯示結果與綠框
            cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
            cv2.putText(frame, f"{label_char}", (xmin, ymin - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 255, 0), 2)
            mp_draw.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

    # 顯示畫面
    cv2.imshow("Hand Sign Prediction", frame)
    if cv2.waitKey(1) == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()