自己嘗試壓縮文檔，到底有多少效果？——（3.）創建字典壓縮文檔

csv json 文字編碼資料壓縮

rex1206 2023-10-20 01:05:58 ‧ 1028 瀏覽

分享至

import csv
with open('台灣郵遞區號.csv', newline='', encoding = "UTF-8", errors='ignore') as csvfile:
    lst = [i for i in csv.reader(csvfile)]

# 把list轉成string
string = ','.join(lst[0])

# 將字串編碼為位元組序列
en = string.encode('utf-8')
# 1byte 裡有256個編碼，完全沒有使用到的編碼, 共 170 個
not_use_bytes = [i for i in range(256) if i not in en]

# 最常出現的三種字串，先取代掉
en = en.replace(',=,=,=,=,'.encode('utf-8'), bytes([not_use_bytes[10]]))
en = en.replace(',=,=,=,'.encode('utf-8')  , bytes([not_use_bytes[11]]))
en = en.replace(',=,=,'.encode('utf-8')    , bytes([not_use_bytes[12]]))

# 找出 2717 個重複出現字，並按節省大小排序（下一章解釋）
lst = longestDupSubstring(en)

# 字典交換
for i in range(10*256+157):
    if i < 157: # 用 157 個 1byte 替換最常出現的重複字彙
        en = en.replace(lst[i], bytes([not_use_bytes[i+13]]))
    else: # 用 2560 個 2byte 替換其他字彙（b'\x00\x00' → b'\x09\xff'
'''
如果要替換的 bytes([(i-157) // 256, (i-157) % 256]) 已經出現在 en，就放棄替換
這一個小細節是我用2個多小時debug出來的慘痛經驗
最後找出原因，在使用 b'\x04\x00' 取代字串時，en裡已經出現 b'\x04\x00'，所以未來解碼的時候就會出錯，把本來出現的 b'\x04\x00'也一併替換掉
我猜測可能是之前替換時 剛好替換的末尾是 \x04，而下一個的替換首字又是 \x00
不知道其他人有沒有除了放棄替換之外的其他方式呢？歡迎留言討論
'''
        if bytes([(i-157) // 256, (i-157) % 256]) in en:
            lst[i] = bytes([(i-157) // 256, (i-157) % 256])
        else:
            en = en.replace(lst[i], bytes([(i-157) // 256, (i-157) % 256]))

# 使用 .bin 儲存資料
with open('data.bin', 'wb') as file:
    file.write(en)

# lst 原本有 2717 項，再加上原始先取代的 3 項，共 2720 項
lst = b'\xff\xff'.join([',=,=,=,=,'.encode('utf-8'),',=,=,=,'.encode('utf-8'),',=,=,'.encode('utf-8')] + lst)
with open('dictionary.bin', 'wb') as file:
    file.write(lst)

# 儲存沒有使用到的編碼
with open("not_use_bytes.txt", "w", encoding = "UTF-8") as f:
    f.write(str(not_use_bytes)[1:-1])

data.bin （708,027 byte）
dictionary.bin （20,127 byte）

(708,027 byte + 20,127 byte)/2,915,093 byte = 25%
最終壓縮為原檔案 25% 大小

自己嘗試壓縮文檔，到底有多少效果？——（4.）題外話，如何快速搜尋相同字串