Day 10 | 手勢識別 (4)：即時辨識＋即時回饋，打造更貼近使用者的手勢系統 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 10

AI & Data

感知你的動作與情緒：深度學習在人機互動的應用系列第 10 篇

Day 10 | 手勢識別 (4)：即時辨識＋即時回饋，打造更貼近使用者的手勢系統

17th鐵人賽手勢識別 cnn mediapipe opencv

minsnow

2025-09-12 20:50:30

162 瀏覽

分享至

前言

在前幾篇文章中，我們已經成功完成了 Sign Language MNIST 資料集的訓練，並透過 CNN 模型進行手勢分類。接下來會將模型與 MediaPipe Hands 與 OpenCV 結合，讓電腦能夠在即時影像中辨識手勢。要讓這個系統更貼近真實應用場景，僅僅能輸出單次的推論結果還不夠，我們還需要即時的回饋機制，例如：

在影像畫面上疊加手勢字母與信心度
過濾錯誤輸入（如 ROI 太小或信心度太低時不輸出）
提供即時串流結果，讓使用者知道系統的連續辨識狀態

實作「即時辨識 + 即時回饋」的手勢系統

import cv2
import mediapipe as mp
import torch
import torchvision.transforms as transforms
from PIL import Image
import torch.nn as nn

CNN 模型結構 (與訓練一致)

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=24):
        super(SimpleCNN, self).__init__()
        self.cnn = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 3 * 3, 256), nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        return self.fc(self.cnn(x))

載入模型

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
model.load_state_dict(torch.load("model.pth", map_location=device))
model.eval()

MediaPipe Hands 初始化

mp_hands = mp.solutions.hands
hands = mp_hands.Hands(static_image_mode=False, max_num_hands=1)
mp_draw = mp.solutions.drawing_utils

OpenCV 開啟鏡頭

cap = cv2.VideoCapture(0)

影像轉換設定

transform = transforms.Compose([
    transforms.Resize((28, 28)),
    transforms.Grayscale(), 
    transforms.ToTensor(),
])

推論主迴圈

while True:
    ret, frame = cap.read()
    if not ret:
        break

    img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    results = hands.process(img_rgb)

    if results.multi_hand_landmarks:
        for hand_landmarks in results.multi_hand_landmarks:
            h, w, _ = frame.shape
            x_list = [lm.x for lm in hand_landmarks.landmark]
            y_list = [lm.y for lm in hand_landmarks.landmark]
            xmin, xmax = int(min(x_list)*w), int(max(x_list)*w)
            ymin, ymax = int(min(y_list)*h), int(max(y_list)*h)

            ### 加 margin
            margin = 20
            xmin, ymin = max(xmin - margin, 0), max(ymin - margin, 0)
            xmax, ymax = min(xmax + margin, w), min(ymax + margin, h)

            ### ===== ROI 過小略過 =====
            if xmax - xmin < 50 or ymax - ymin < 50:
                continue

            ### 擷取 ROI
            hand_roi = frame[ymin:ymax, xmin:xmax]
            hand_pil = Image.fromarray(cv2.cvtColor(hand_roi, cv2.COLOR_BGR2RGB))
            input_tensor = transform(hand_pil).unsqueeze(0).to(device)

            ### CNN 推論
            with torch.no_grad():
                output = model(input_tensor)
                probs = torch.softmax(output, dim=1)
                conf, pred = torch.max(probs, dim=1)
                pred, conf = pred.item(), conf.item()

                ### ===== 信心度過低略過 =====
                if conf < 0.7:
                    continue

                ### Label 編碼轉字母（跳過 J）
                label_char = chr(pred + 65 if pred < 9 else pred + 66)

            ### ===== 顯示結果 =====
            cv2.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 0), 2)
            text = f"{label_char} ({conf*100:.1f}%)"
            cv2.putText(frame, text, (xmin, ymin - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 255, 255), 2)

            mp_draw.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS)

    cv2.imshow("Hand Sign Prediction", frame)
    if cv2.waitKey(1) == ord('q'):
        break
cap.release()
cv2.destroyAllWindows()