今天把下載到的pretrain資料重新整理了一遍,並且加入了百度的資料,百度的檔案很大,我的電腦只有64GB的記憶體,使用原本的data_process.py
中的process_baidu()
函數會因為記憶體不夠用額處理失敗,所以我重寫了這個函數。
def process_baidu(fpath, outpath, batch_size=10000):
def write_doc_ids(doc_ids, out_file):
arr = np.array(doc_ids, dtype=np.uint16)
out_file.write(arr.tobytes())
cnt = 0
doc_ids = []
with open(fpath, 'r') as f:
out_file = open(outpath, 'wb')
while True:
line = f.readline()
if not line:
break
line = json.loads(line)
text = ''
try:
text += line['title'] + ':' + line['summary']
except:
pass
for per in line['sections']:
text += per['title'] + ':' + per['content'] + '。'
text_id = tokenizer.encode(text, add_special_tokens=False)
text_id.append(tokenizer.special_tokens['<eos>'])
if len(text_id) > 5:
doc_ids += text_id
cnt += 1
if cnt % batch_size == 0:
write_doc_ids(doc_ids, out_file)
doc_ids = []
print(cnt)
# Write any remaining doc_ids
if doc_ids:
write_doc_ids(doc_ids, out_file)
out_file.close()
print("Total lines processed:", cnt)
對於其他process_
開頭的函數基本上我只有把輸入檔案路徑
與輸出檔案路徑
改成由外部指定,避免容易產生bug;之後重新整理了main function:
wiki.bin
, medical_book.bin
, medical_encyclopedia.bin
, baidubaike_563w.bin
if __name__=="__main__":
tokenizer=ChatGLMTokenizer(vocab_file='./chatglm_tokenizer/tokenizer.model')
process_wiki_clean(
'./data/wikipedia-cn-20230720-filtered/wikipedia-cn-20230720-filtered.json',
'./data/wiki.bin')
process_medical(
'./data/medical/pretrain/medical_book_zh.json',
'./data/medical_book.bin')
process_medical(
'./data/medical/pretrain/train_encyclopedia.json',
'./data/medical_encyclopedia.bin')
process_baidu(
'./data/563w_baidubaike/563w_baidubaike.json',
'./data/baidubaike_563w.bin')
原本在data_process.py
的後面會將所有.bin
檔全部整合到一個pretrain_data.bin
檔案中,這樣pretrain.py
在訓練時就只需要一個資料檔;不過為了方便做實驗,後面的這些程式碼我就全部不執行了,因此我在之後的pretrain.py
並不會使用到pretrain_data.bin
這個檔案,而是會使用前面得到的四個.bin
檔。
if __name__=="__main__":
......
data_path_list=[
'./data/wiki.bin',
'./data/medical_book.bin',
'./data/medical_encyclopedia.bin',
'./data/baidubaike_563w.bin'
]
data_lst=[]
for data_path in data_path_list:
with open(data_path,'rb') as f:
data=np.fromfile(f,dtype=np.uint16)
data_lst.append(data)
arr = np.concatenate(data_lst)
print(arr.shape)
with open('./data/pretrain_data.bin','wb') as f:
f.write(arr.tobytes())