# Day22-評價摘要好壞的演算法

2022 iThome 鐵人賽

DAY 22

AI & Data

變形金剛與抱臉怪---NLP 應用開發之實戰系列第 22 篇

14th鐵人賽 transformer hugging face azure machine learning

大魔術熊貓工程師

2022-10-07 21:56:36

3771 瀏覽

分享至

評價摘要的好壞

我們用了兩個模型做了摘要，那麼有沒有辦法評價摘要的好壞呢？常見評價摘要的算法有兩種，一個是 BLEU，一個是 ROGUE。

BLEU 是一種基於精度的指標，這代表著當我們比較兩個文本時，我們計算參考中產生出來的摘要文本的單詞數，然後將其除以產生出來的摘要文本的長度。但是，如果產生的摘要只是一遍又一遍地重複同一個詞，並且這個詞也出現在原文中。如果它重複的次數與參考文本的長度一樣多，那麼我們將獲得完美的精度！
例如說原文是「I have a pen, I have an apple.」，而產生的摘要文本是「have, have, have, have, have, have, have, have」，那麼這樣就是 2/8。如果原文出現了8 個 have，那麼就是 8/8 了。所以我們可以很明顯地看到 BLEU 的缺點，因此現今大部份都不用 BLEU 來評價產生的摘要的好壞了。
ROGUE 是最近比較受歡迎的方式，其核心思想就是 Recall 比 Precision 還要重要。不懂什麼是 Recall 的，可以去看筆者以前的教材，有關於 confusion matrix 的解說。
ROUGE 有幾種指標，其原理簡單說明如下。
1. Rouge1、Rouge2、RougeN：Rouge1 就是一個詞一個詞地和原文比較，Rouge2 就是兩個詞一組和原文比較，RougeN 就是 N 個詞一組和原文比較。類似於 BLEU 算法的概念。
2. 舉例來說原文是「I have a pen, I have an apple.」，摘要是「I have a pen and an apple.」，Rouge1 就是「I , have, a , pen, and, an, apple」這種方式和原文比較，而 Rouge2 就是「I have, have a, a pen ....」這種方式來做計算。依此類推至 N。
3. RougeL：一種計算每個句子的分數並將其平均用於摘要。
4. RougeLsum：直接計算整個摘要的分數。
更多詳細的內容，可以去讀原論文，我們先來開始寫程式吧！

用 hugging face 來做 ROGUE

pip install rouge_score，如果沒有安裝的話。
載入 rouge_score。

from datasets import load_metric

rouge_metric = load_metric("rouge")

沿續昨天的程式碼，我們把 T5 的結果放進去給 ROUGE 做評分。

scores = rouge_metric.compute(
    predictions=[paragraph_result_T5], references=[input_text]
)
print(scores)

會得到：

{
   "rouge1":AggregateScore(low=Score(precision=0.918918918918919,
		   recall=0.1691542288557214,
		   fmeasure=0.2857142857142857),
		   mid=Score(precision=0.918918918918919,
		   recall=0.1691542288557214,
		   fmeasure=0.2857142857142857),
		   high=Score(precision=0.918918918918919,
		   recall=0.1691542288557214,
		   fmeasure=0.2857142857142857)),
   "rouge2":AggregateScore(low=Score(precision=0.6666666666666666,
		   recall=0.12,
		   fmeasure=0.20338983050847456),
		   mid=Score(precision=0.6666666666666666,
		   recall=0.12,
		   fmeasure=0.20338983050847456),
		   high=Score(precision=0.6666666666666666,
		   recall=0.12,
		   fmeasure=0.20338983050847456)),
   "rougeL":AggregateScore(low=Score(precision=0.8378378378378378,
		   recall=0.15422885572139303,
		   fmeasure=0.26050420168067223),
		   mid=Score(precision=0.8378378378378378,
		   recall=0.15422885572139303,
		   fmeasure=0.26050420168067223),
		   high=Score(precision=0.8378378378378378,
		   recall=0.15422885572139303,
		   fmeasure=0.26050420168067223)),
   "rougeLsum":AggregateScore(low=Score(precision=0.918918918918919,
		   recall=0.1691542288557214,
		   fmeasure=0.2857142857142857),
		   mid=Score(precision=0.918918918918919,
		   recall=0.1691542288557214,
		   fmeasure=0.2857142857142857),
		   high=Score(precision=0.918918918918919,
		   recall=0.1691542288557214,
		   fmeasure=0.2857142857142857))
	}

沿續昨天的程式碼，我們把 pegasus 的結果放進去給 ROUGE 做評分。

scores = rouge_metric.compute(
    predictions=[paragraph_result_pegasus], references=[input_text]
)
scores

會得到：

{
   "rouge1":AggregateScore(low=Score(precision=1.0,
		   recall=0.1890547263681592,
		   fmeasure=0.3179916317991632),
		   mid=Score(precision=1.0,
		   recall=0.1890547263681592,
		   fmeasure=0.3179916317991632),
		   high=Score(precision=1.0,
		   recall=0.1890547263681592,
		   fmeasure=0.3179916317991632)),
   "rouge2":AggregateScore(low=Score(precision=0.9459459459459459,
		   recall=0.175,
		   fmeasure=0.29535864978902954),
		   mid=Score(precision=0.9459459459459459,
		   recall=0.175,
		   fmeasure=0.29535864978902954),
		   high=Score(precision=0.9459459459459459,
		   recall=0.175,
		   fmeasure=0.29535864978902954)),
   "rougeL":AggregateScore(low=Score(precision=1.0,
		   recall=0.1890547263681592,
		   fmeasure=0.3179916317991632),
		   mid=Score(precision=1.0,
		   recall=0.1890547263681592,
		   fmeasure=0.3179916317991632),
		   high=Score(precision=1.0,
		   recall=0.1890547263681592,
		   fmeasure=0.3179916317991632)),
   "rougeLsum":AggregateScore(low=Score(precision=1.0,
		   recall=0.1890547263681592,
		   fmeasure=0.3179916317991632),
		   mid=Score(precision=1.0,
		   recall=0.1890547263681592,
		   fmeasure=0.3179916317991632),
		   high=Score(precision=1.0,
		   recall=0.1890547263681592,
		   fmeasure=0.3179916317991632))
}