# Day21-Hugging Face 摘要任務入門

2022 iThome 鐵人賽

DAY 21

AI & Data

變形金剛與抱臉怪---NLP 應用開發之實戰系列第 21 篇

14th鐵人賽 huggingface transformer azure machine learning

大魔術熊貓工程師

2022-10-06 17:05:00

2703 瀏覽

分享至

摘要（summarization）也是自然語言處理中很常見的任務之一，今天我們就來看看 Hugging Face 如何幫我們幫我們做摘要吧！

Encoder-Decoder transformer

在 Day17 講 Transformer 種類時，有提到 Encoder-Decoder 的架構是相當適合來做摘要的，我們就來用其中的代表作 T5 transformer 來做摘要吧！

我們載入一段文本，是那天文本生成出來的。

input_text="""
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

Mr Darling, the former Liberal Democrat chancellor, admitted that the situation had become critical but insisted that there was still time to turn things around. 

He told the BBC that the crisis in the banking sector was the most serious problem facing the economy but also highlighted other issues, such as the falling value of sterling and the threat of inflation. 

"The worst fears about the banking crisis seem not to be panning out," he said, adding that there had not been a single banker arrested or charged over the crash. 

"The economy, the economy"

Mr Darling said "there's been a very, very strong recovery" since the autumn of 2008.

"There are very big problems ahead of us, not least of which is inflation. It is likely to be a very high inflation rate. "

The economy is expected to grow by 0.3% in the quarter to the end of this year.
"""

我們直接用 Pipeline 的方式，呼叫 T5 模型吧！

from transformers import pipeline

pipe = pipeline("summarization", model="t5-large")
result = pipe(input_text)
result

會得到這樣的結果：

[{'summary_text': 'former lib dem chancellor forced to consider second bailout for banks . evidence mounts that 37 billion part-nationalisation last yearr has failed to keep credit flowing . darling insists that there is still time to turn things around .'}]

原文裡面 Mr. Darling 強調說經濟正在強勁的復蘇，而產生的摘要說 turn things around。真的是太有趣了。
另一套同樣是 Encoder-Decoder 的架構的模型 pegasus，是近年來很受歡迎的摘要生成的模型，我們也來玩玩他吧！

pipe_pegasus = pipeline("summarization", model="google/pegasus-cnn_dailymail")
result_pegasus = pipe_pegasus(input_text)
result_pegasus

會得到這樣的結果，其中的 <n> 就是換行符號 \n：

{'summary_text': 'Mr Darling admitted that the situation had become critical but insisted that there was still time to turn things around .<n>The economy is expected to grow by 0.3% in the quarter to the end of this year .'}

NLTK

nltk 是一套英文常用的分詞工具，可以避免像是 U.S. 這類的縮寫因為有句號而被被當成句子的情形。

import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")

來玩一下 nltk。

string = "The U.S. are a country. Mr. White vs. Heisenberg."

sent_tokenize(string)

很多句號在句子內，但是還是會被拆成兩句話。

['The U.S. are a country.', 'Mr. White vs. Heisenberg.']

接著我們把 NLTK 套進來，把我們的剛得到兩個摘要結果整理一下吧！

paragraph_result_T5 = "\n".join(sent_tokenize(result[0]["summary_text"]))
print(paragraph_result_T5)

paragraph_result_pegasus = "\n".join(sent_tokenize(result_pegasus[0]["summary_text"].replace(" .<n>", " .\n")))
print(paragraph_result_pegasus)

會得到下面的結果，看起來舒服多了：

former lib dem chancellor forced to consider second bailout for banks .
evidence mounts that 37 billion part-nationalisation last yearr has failed to keep credit flowing .
darling insists that there is still time to turn things around .

Mr Darling admitted that the situation had become critical but insisted that there was still time to turn things around .
The economy is expected to grow by 0.3% in the quarter to the end of this year .