【Python OCR 使用手冊】圖片轉文字超簡單上手

python python3 tesseract pytesseract ocr

Enoxs 2022-01-22 22:18:03 ‧ 28794 瀏覽

分享至

OCR x Pytesseract

前言

在 Python 中，使用 OCR (Optical Character Recognition , 字元辨識)
將圖片的內容轉換成一般的文本，非常簡單。

只要將相關軟體與 Python 套件安裝完成後，即可運行程式，

這份文件就是將之前的踩坑過程記錄下來，以供想後續想要研究的開發者可以快速上手。

安裝程式與範例程式

【安裝文件】

https://gitlab.com/GammaRayStudio/DevDoc/-/blob/master/Python/004.PythonOCR.md

【範例程式】

https://gitlab.com/GammaRayStudio/Program/PythonStudio/SE/PythonOCR

圖片範例

轉換目標

英文

圖片

001

文字

English
Gamma Ray Studio
English Text
Text Text Text ~ !!!

繁體中文

圖片

002

文字

繁體中文
Gamma Ray 軟體工作室
中文 文字
文字 文字 文字 ~ !!!

簡體中文

圖片

003

文字

简体中文
Gamma Ray 软体工作室
中文 文字
文字 文字 文字 ~ !!!

安裝 Tesseract

Win

https://github.com/UB-Mannheim/tesseract/wiki

環境變數

win-path

Mac

brew install tesseract

Linux

apt-get install tesseract-ocr

驗證

tesseract -v

004

Python 環境

Python 版本

python -V

Python 3.8.5

PyPI

Pillow
pytesseract

pip3 install Pillow
pip3 install pytesseract

Python 範例

from PIL import Image
import pytesseract
img_name = './001.en-us.png'
img = Image.open(img_name)
text = pytesseract.image_to_string(img, lang='eng')
print(text)

PIL : 處理圖片 Pillow
pytesseract : OCR 模組 Pytesseract
img_name = './001.en-us.png' : 圖片路徑
img = Image.open(img_name) : 載入圖片
text = pytesseract.image_to_string(img, lang='eng') : 圖片轉文字，使用英文語系

Output

005

English

Gamma Ray Studio
English Text

Text Text Text ~ 11!

驚嘆號的地方被辨認為 1 ，但基本上大部分都辯認得出來

中文辨識

from PIL import Image
import pytesseract
img_name = './002.zh-cht.png'
img = Image.open(img_name)
text = pytesseract.image_to_string(img, lang='eng')
print(text)

img_name = './002.zh-cht.png' : 調整載入的圖片 中文

Output

006

SRE
Gamma Ray BREA TER

FX XF

XF XF XF~

現階段，中文的轉換會變成不認識的編碼
新增語言庫，可以添加更多的語言辨識

下載語言庫

GitHub - Tessdata

https://github.com/tesseract-ocr/tessdata_best

英文

eng.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/main/eng.traineddata

繁體中文

chi_tra.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/main/chi_tra.traineddata

簡體中文

chi_sim.traineddata

https://github.com/tesseract-ocr/tessdata_best/blob/main/chi_sim.traineddata

預設路徑

以 Mac 為例

Tessdata 程式路徑

/usr/local/Cellar/tesseract

語言包路徑

/usr/local/Cellar/tesseract/4.1.3/share/tessdata

![007](https://lh3.googleusercontent.com/-Okelzx901cs/YewQOZ-9rJI/AAAAAAAADzs/WtImA_gIQzklJLe8lHKAkMthtIsYyIFcgCNcBGAsYHQ/s16000/007.language-package.png

下載語言包後，放到程式資料夾內的 /share/tessdata 路徑，就可以生效

配置環境變數

獨立資料夾
008

Win

C:\DevTools\tessdata

環境變數

009-2

Mac

/Users/Enoxs/DevTools/tessdata

.zprofile

# TESSDATA
export TESSDATA_PREFIX=/Users/Enoxs/DevTools/tessdata

Linux

.bash_profile

# TESSDATA
export TESSDATA_PREFIX=/Users/enoxs/DevTools/tessdata

語系參數調整

from PIL import Image
import pytesseract
img_name = './002.zh-cht.png'
img = Image.open(img_name)
text = pytesseract.image_to_string(img, lang='chi_tra+eng')
print(text)

img_name = './002.zh-cht.png' : 調整載入的圖片 繁體中文
text = pytesseract.image_to_string(img, lang='chi_tra+eng') : 圖片轉文字，使用繁體中文與英文
- 英文 : eng
- 繁體中文 : chi_tra
- 簡體中文 : chi_sim

平時用的程式碼

無 Code Review ，Free Style

執行 ocr.py 
將 image 資料夾內的圖片轉換成文字，
並且依據原始檔案名稱，保存在 text 資料夾中

010

from PIL import Image
import pytesseract
import os
from os import listdir
from os.path import isfile, join


def ocrText(fileName):
    img = Image.open(fileName)
    # text = pytesseract.image_to_string(img, lang='eng')
    # text = pytesseract.image_to_string(img, lang='eng+chi_tra')
    text = pytesseract.image_to_string(img, lang='eng+chi_tra+chi_sim')
    return text


def replaceText(str):
    str = str.replace(",", "，")
    text = str.replace(" ", "")
    return text


def save(fileName, text):
    print("text.length => ", len(text))
    with open(fileName, 'w', encoding='UTF-8') as f:
        f.write(text)
        f.close


def main():
    path = '.' + os.sep + 'image'
    lstFile = [f for f in listdir(path) if isfile(join(path, f))]

    for f in lstFile:
        if '.png' in f:
            idx = f.find('.png')
            out_name = ""
            for i in range(idx):
                out_name += f[i]
            print(out_name)
            text = ocrText('.' + os.sep + 'image' + os.sep + '{}'.format(f))
            # text = replaceText(text)
            path = '.' + os.sep + 'text' + os.sep + out_name + '.txt'
            save(path, text)


if __name__ == "__main__":
    main()

ocrText() : 將目標圖片轉換成文字
replaceText() : 依據需求替換指定文字
save() : 將文字保存到指定路徑
main() : 整體流程
1. 取得 image 資料夾下所有的檔案名
2. 使用 for in 逐筆轉換文字與處理檔名
3. 轉換完成後，將文字保存到 text 資料夾

Reference

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

1 則留言

arguskao

iT邦新手 3 級 ‧ 2023-06-15 13:41:22

效果不是太好，不知道是不是因為免費的模組

回應
檢舉

登入發表回應

我要留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22200 篇

完賽人數

602 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙