我們用了兩個模型做了摘要,那麼有沒有辦法評價摘要的好壞呢?常見評價摘要的算法有兩種,一個是 BLEU
,一個是 ROGUE
。
BLEU
是一種基於精度的指標,這代表著當我們比較兩個文本時,我們計算參考中產生出來的摘要文本的單詞數,然後將其除以產生出來的摘要文本的長度。但是,如果產生的摘要只是一遍又一遍地重複同一個詞,並且這個詞也出現在原文中。如果它重複的次數與參考文本的長度一樣多,那麼我們將獲得完美的精度!
例如說原文是「I have a pen, I have an apple.」,而產生的摘要文本是 「have, have, have, have, have, have, have, have」,那麼這樣就是 2/8。如果原文出現了8 個 have,那麼就是 8/8 了。所以我們可以很明顯地看到 BLEU 的缺點,因此現今大部份都不用 BLEU 來評價產生的摘要的好壞了。
ROGUE 是最近比較受歡迎的方式,其核心思想就是 Recall 比 Precision 還要重要。不懂什麼是 Recall 的,可以去看筆者以前的教材,有關於 confusion matrix 的解說。
ROUGE 有幾種指標,其原理簡單說明如下。
更多詳細的內容,可以去讀原論文,我們先來開始寫程式吧!
pip install rouge_score
,如果沒有安裝的話。from datasets import load_metric
rouge_metric = load_metric("rouge")
scores = rouge_metric.compute(
predictions=[paragraph_result_T5], references=[input_text]
)
print(scores)
會得到:
{
"rouge1":AggregateScore(low=Score(precision=0.918918918918919,
recall=0.1691542288557214,
fmeasure=0.2857142857142857),
mid=Score(precision=0.918918918918919,
recall=0.1691542288557214,
fmeasure=0.2857142857142857),
high=Score(precision=0.918918918918919,
recall=0.1691542288557214,
fmeasure=0.2857142857142857)),
"rouge2":AggregateScore(low=Score(precision=0.6666666666666666,
recall=0.12,
fmeasure=0.20338983050847456),
mid=Score(precision=0.6666666666666666,
recall=0.12,
fmeasure=0.20338983050847456),
high=Score(precision=0.6666666666666666,
recall=0.12,
fmeasure=0.20338983050847456)),
"rougeL":AggregateScore(low=Score(precision=0.8378378378378378,
recall=0.15422885572139303,
fmeasure=0.26050420168067223),
mid=Score(precision=0.8378378378378378,
recall=0.15422885572139303,
fmeasure=0.26050420168067223),
high=Score(precision=0.8378378378378378,
recall=0.15422885572139303,
fmeasure=0.26050420168067223)),
"rougeLsum":AggregateScore(low=Score(precision=0.918918918918919,
recall=0.1691542288557214,
fmeasure=0.2857142857142857),
mid=Score(precision=0.918918918918919,
recall=0.1691542288557214,
fmeasure=0.2857142857142857),
high=Score(precision=0.918918918918919,
recall=0.1691542288557214,
fmeasure=0.2857142857142857))
}
scores = rouge_metric.compute(
predictions=[paragraph_result_pegasus], references=[input_text]
)
scores
會得到:
{
"rouge1":AggregateScore(low=Score(precision=1.0,
recall=0.1890547263681592,
fmeasure=0.3179916317991632),
mid=Score(precision=1.0,
recall=0.1890547263681592,
fmeasure=0.3179916317991632),
high=Score(precision=1.0,
recall=0.1890547263681592,
fmeasure=0.3179916317991632)),
"rouge2":AggregateScore(low=Score(precision=0.9459459459459459,
recall=0.175,
fmeasure=0.29535864978902954),
mid=Score(precision=0.9459459459459459,
recall=0.175,
fmeasure=0.29535864978902954),
high=Score(precision=0.9459459459459459,
recall=0.175,
fmeasure=0.29535864978902954)),
"rougeL":AggregateScore(low=Score(precision=1.0,
recall=0.1890547263681592,
fmeasure=0.3179916317991632),
mid=Score(precision=1.0,
recall=0.1890547263681592,
fmeasure=0.3179916317991632),
high=Score(precision=1.0,
recall=0.1890547263681592,
fmeasure=0.3179916317991632)),
"rougeLsum":AggregateScore(low=Score(precision=1.0,
recall=0.1890547263681592,
fmeasure=0.3179916317991632),
mid=Score(precision=1.0,
recall=0.1890547263681592,
fmeasure=0.3179916317991632),
high=Score(precision=1.0,
recall=0.1890547263681592,
fmeasure=0.3179916317991632))
}
明天我們再來講怎麼 fine-tuned 摘要任務的 transformer 吧!