它支援以下幾種資料格式
資料格式 | loading scripts | 舉例 |
---|---|---|
CSV & TSV | csv | load_dataset("csv", data_files="my_file.csv") |
Text files | text | load_dataset("text", data_files="my_file.txt") |
JSON & JSON Lines | json | load_dataset("json", data_files="my_file.jsonl") |
Pickled DataFrames | pandas | load_dataset("pandas", data_files="my_dataframe.pkl") |
表格取自 Hugging Face 官方 |
格式名稱
和檔案路徑 or URL
的參數這邊要補充說明 JSON
和 JSON Lines
哪裡不一樣
{
"user": {
"id": 1,
"name": "John Doe",
"email": "john.doe@example.com",
"isStudent": true,
"courses": [
{
"id": 101,
"title": "Introduction to Programming",
"instructor": "Jane Smith"
},
{
"id": 102,
"title": "Data Structures and Algorithms",
"instructor": "Tom Brown"
}
]
}
}
{"id": "2834", "tokens": ["星", "巴", "克", "小", "圓", "零", "錢", "包"], "ner_tags": ["B-BRAND", "I-BRAND", "I-BRAND", "O", "O", "B-ITEM", "I-ITEM", "I-ITEM"]}
{"id": "4516", "tokens": ["e", "x", "c", "e", "l", " ", "漸", "層", "魅", "色", "腮", "紅"], "ner_tags": ["B-BRAND", "I-BRAND", "I-BRAND", "I-BRAND", "I-BRAND", "O", "O", "O", "O", "O", "B-ITEM", "I-ITEM"]}
{"id": "8103", "tokens": ["m", "e", "k", "o", "魔", "翹", "美", "型", "纖", "長", "睫", "毛", "膏"], "ner_tags": ["B-BRAND", "I-BRAND", "I-BRAND", "I-BRAND", "O", "O", "O", "O", "O", "O", "B-ITEM", "I-ITEM", "I-ITEM"]}
from datasets import load_dataset
dataset_url = "https://github.com/karpathy/char-rnn/blob/master/data/tinyshakespeare/input.txt"
text_dataest = load_dataset('text', data_files=dataset_url)
print(text_dataest['train'][:5])
text
,再將遠端檔案以 URL
的方式傳遞給 load_dataset{
'text': ['First Citizen:',
'Before we proceed any further, hear me speak.',
'',
'All:',
'Speak, speak.']
}
from datasets import load_dataset
dataset_url = "https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz"
squad_it_dataset = load_dataset("json", data_files=dataset_url, field="data")
print(squad_it_dataset)
DatasetDict({
train: Dataset({
features: ['title', 'paragraphs'],
num_rows: 442
})
})
from datasets import load_dataset
squad_it_dataset = load_dataset("json", data_files="SQuAD_it-test.json", field="data")
print(squad_it_dataset)
DatasetDict({
train: Dataset({
features: ['title', 'paragraphs'],
num_rows: 48
})
})
train
和test
的 DatasetDict 對象像是 SQuAD_it-train.json 和 SQuAD_it-test.json 建立成一個完整的 DatasetDict 對象,這樣的話就可以使用 Dataset.map() 函數同時處理訓練集和測試集。因此我們提供參數 data_files 的字典,將每個分割名稱映射到與該分割相關聯的資料
from datasets import load_dataset
data_files = { "train" : "SQuAD_it-train.json" , "test" : "SQuAD_it-test.json" }
squad_it_dataset = load_dataset( "json" , data_files=data_files, field= "data" )
print(squad_it_dataset)
DatasetDict({
train: Dataset({
features: ['title', 'paragraphs'],
num_rows: 442
})
test: Dataset({
features: ['title', 'paragraphs'],
num_rows: 48
})
})
這就是我們需要的資料。我們可以應用各種預處理技術來清理資料、標記評論等。
data_files = {
"train" : "train.json" ,
"test" : "test.json",
"validation" : "validation.json"
}