今天我們終於要再繼續寫程式了,沿續使用昨天的 poem_sentiment 這個 dataset。
打開你的Azure Machine Learning 裡 Jupyter Notebook 吧!
pip install datasets
load_dataset_builder
不會把資料下載下來。from datasets import load_dataset_builder
ds_builder = load_dataset_builder("poem_sentiment")
print(ds_builder.info.description)
print(ds_builder.info.features)
from datasets import load_dataset
sentiment = load_dataset("poem_sentiment")
sentiment
會看到下列的資訊
DatasetDict({
train: Dataset({
features: ['id', 'verse_text', 'label'],
num_rows: 892
})
validation: Dataset({
features: ['id', 'verse_text', 'label'],
num_rows: 105
})
test: Dataset({
features: ['id', 'verse_text', 'label'],
num_rows: 104
})
})
train_ds = sentiment["train"]
valid_ds = sentiment["validation"]
test_ds = sentiment["test"]
dataset_train = load_dataset("rotten_tomatoes", split="train")
import pandas as pd
sentiment.set_format(type="pandas")
df = sentiment["train"][:]
df.head(10)
int2str
來把 label 轉成文字。def label_int2str(row):
return sentiment["train"].features["label"].int2str(row)
df["label_name"] = df["label"].apply(label_int2str)
df.head(10)
import matplotlib.pyplot as plt
df["label_name"].value_counts().plot.barh()
plt.title("Poem Classes")
plt.show()
sentiment.reset_format()
# 也可以把 pandas 處理過的轉成新的 dataset
from datasets import Dataset
label_name_dataset = Dataset.from_pandas(df)
label_name_dataset
sentiment_train = sentiment["train"].shuffle(seed=5566).select(range(100))
sentiment_filtered = sentiment.filter(lambda x: len(x["verse_text"]) > 30)
sentiment_filtered
batched=True
,我們之後還會再見到它。new_dataset = sentiment.map(
lambda x: {"verse_text": [ len(o) for o in x["verse_text"] ] }, batched=True
)
new_dataset['test'][:3]
以上就是今天關於從 hub 取得 dataset 的基本操作啦!明天我們再來聊聊怎麼匯入自己的 dataset。