【第26天】探討與改善-增加訓練樣本(一)

2021 iThome 鐵人賽

DAY 26

AI & Data

手寫中文字之影像辨識系列第 26 篇

13th鐵人賽

Ethan Chen

2021-10-11 23:53:56

1811 瀏覽

分享至

摘要

前言
作業流程
手寫中文字開源資料
空白背景圖片
篩選出官方800字內

內容

前言

1.1 從賽後的交流中得知，勝出的隊伍將重心放在資料集本身，而非設計或採用更新更強大的模型架構。分享中曾提及：「修正錯誤標籤、增加大量擬真訓練樣本、改善資料類別不均衡...等」。

1.2 事後回想，我們在這些方面，確實沒有特別下工夫。因此，後續幾天我們將透過實作嘗試這些技巧。
作業流程(今日進度為2.1~2.3)

2.1 手寫中文字開源資料

2.2 空白背景圖片

2.3 篩選出官方800字內

2.4 OpenCV合成新訓練集
手寫中文字開源資料

3.1 AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset

3.2 kirosc/chinese-calligraphy-dataset

空白背景圖片

4.1 觀察官方資料集，內含不少只有空白背景的圖片，如下圖。

4.2 在【第4天】資料前處理-圖檔分類與裁切中，我們曾以訓練好的YOLOv4模型框選中文字，得知每張圖檔框選出的中文字數。

4.3 我們依照偵測到的中文字數量分類，取出沒有字(no_word)的圖檔。

def copyClassify(file ,input, boxes, file_name, l, m, n):
    box_num = len(boxes)
    if box_num == 0:
        shutil.copy2(input, './02_yolo_classify3/03_no_word/{}'.format(file_name))
        print('※{}成功複製到no_word'.format(file))
    elif box_num == 1:
        shutil.copy2(input, './02_yolo_classify3/01_word/{}'.format(file_name))
        print('※{}成功複製到word'.format(file))
    else:
        shutil.copy2(input, './02_yolo_classify3/02_words/{}'.format(file_name))
        print('※{}成功複製到words'.format(file))
    print('  沒有字：{}張'.format(l))
    print('  1個字：{}張'.format(m))
    print('  2個字以上：{}張'.format(n))

4.4 最終，取得no_word圖檔約400張，如下圖。

篩選出官方800字內

5.1 兩個手寫中文字開源資料，合計265,249張圖檔。其中，不只有官方800字內的文字，故需先進行篩選。

5.2 程式碼

import os
import shutil

# 讀取txt檔
def read_dicts(path):
    file1 = open(path, 'rt', encoding="utf-8")
    words = file1.read().split('\n')
    file1.close()
    return words

# 判定是否屬於字典中的字
def chech_in_dicts(source, words):
    files = os.listdir(source)
    move_record = ''
    print('※開始判定是否屬於字典中的字...')
    for file in files:
        if file[0] in words:
            print('{}在字典裡'.format(file))
        else:
            print('{}不在字典裡'.format(file))
            file += ','
            move_record += file
    print('=' * 50)
    print('※判定完畢')
    print('=' * 50)
    return move_record

# 移動檔案到目標資料夾
def move_to_des(move_record, source, destination):
    move_list = move_record.split(',')[:-1]
    print('※開始移動檔案到目標資料夾')
    for move_it in move_list:
        shutil.move(source+move_it, destination)
        print('{}已成功移動到資料夾：其他字'.format(move_it))
    print('=' * 50)
    print('※移動完畢')

if __name__ == '__main__':
    # training data dic.txt
    dics = './data/training data dic.txt'
    # 待判定的資料夾
    source = './base/'
    # 目的地資料夾
    destination = './out800/'

    # 執行任務
    words = read_dicts(dics)
    move_record = chech_in_dicts(source, words)
    move_to_des(move_record, source, destination)

5.3 結果

判定前
判定後

小結

取得官方800字內的手寫中文字與空白背景後，下一章的目標是：「使用OpenCV合成新的訓練樣本」

讓我們繼續看下去...

參考資料

AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset
- 本數據集由 AI . FREE Team 改作開發自 [STUST EECS_Chinese MNIST(總集)]。如有使用、改作、分享，請註明出處及此訊息。
- The dataset is AI . FREE Team development from [STUST EECS_Chinese MNIST(總集)]. If used, modified, or shared, please cite the source and the mesage.
- (source: https://github.com/AI-FREE-Team/Traditional-Chinese-Handwriting-Dataset )
kirosc/chinese-calligraphy-dataset