iT邦幫忙

2022 iThome 鐵人賽

DAY 21
0

摘要(summarization)也是自然語言處理中很常見的任務之一,今天我們就來看看 Hugging Face 如何幫我們幫我們做摘要吧!

Encoder-Decoder transformer

在 Day17 講 Transformer 種類時,有提到 Encoder-Decoder 的架構是相當適合來做摘要的,我們就來用其中的代表作 T5 transformer 來做摘要吧!

  1. 我們載入一段文本,是那天文本生成出來的。
input_text="""
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

Mr Darling, the former Liberal Democrat chancellor, admitted that the situation had become critical but insisted that there was still time to turn things around. 

He told the BBC that the crisis in the banking sector was the most serious problem facing the economy but also highlighted other issues, such as the falling value of sterling and the threat of inflation. 

"The worst fears about the banking crisis seem not to be panning out," he said, adding that there had not been a single banker arrested or charged over the crash. 

"The economy, the economy"

Mr Darling said "there's been a very, very strong recovery" since the autumn of 2008.

"There are very big problems ahead of us, not least of which is inflation. It is likely to be a very high inflation rate. "

The economy is expected to grow by 0.3% in the quarter to the end of this year.
"""
  1. 我們直接用 Pipeline 的方式,呼叫 T5 模型吧!
from transformers import pipeline

pipe = pipeline("summarization", model="t5-large")
result = pipe(input_text)
result

會得到這樣的結果:

[{'summary_text': 'former lib dem chancellor forced to consider second bailout for banks . evidence mounts that 37 billion part-nationalisation last yearr has failed to keep credit flowing . darling insists that there is still time to turn things around .'}]
  1. 原文裡面 Mr. Darling 強調說經濟正在強勁的復蘇,而產生的摘要說 turn things around。真的是太有趣了。

  2. 另一套同樣是 Encoder-Decoder 的架構的模型 pegasus,是近年來很受歡迎的摘要生成的模型,我們也來玩玩他吧!

pipe_pegasus = pipeline("summarization", model="google/pegasus-cnn_dailymail")
result_pegasus = pipe_pegasus(input_text)
result_pegasus

會得到這樣的結果,其中的 <n> 就是換行符號 \n

{'summary_text': 'Mr Darling admitted that the situation had become critical but insisted that there was still time to turn things around .<n>The economy is expected to grow by 0.3% in the quarter to the end of this year .'}

NLTK

  1. nltk 是一套英文常用的分詞工具,可以避免像是 U.S. 這類的縮寫因為有句號而被被當成句子的情形。
import nltk
from nltk.tokenize import sent_tokenize

nltk.download("punkt")
  1. 來玩一下 nltk。
string = "The U.S. are a country. Mr. White vs. Heisenberg."

sent_tokenize(string)

很多句號在句子內,但是還是會被拆成兩句話。

['The U.S. are a country.', 'Mr. White vs. Heisenberg.']
  1. 接著我們把 NLTK 套進來,把我們的剛得到兩個摘要結果整理一下吧!
paragraph_result_T5 = "\n".join(sent_tokenize(result[0]["summary_text"]))
print(paragraph_result_T5)

paragraph_result_pegasus = "\n".join(sent_tokenize(result_pegasus[0]["summary_text"].replace(" .<n>", " .\n")))
print(paragraph_result_pegasus)

會得到下面的結果,看起來舒服多了:

former lib dem chancellor forced to consider second bailout for banks .
evidence mounts that 37 billion part-nationalisation last yearr has failed to keep credit flowing .
darling insists that there is still time to turn things around .

Mr Darling admitted that the situation had become critical but insisted that there was still time to turn things around .
The economy is expected to grow by 0.3% in the quarter to the end of this year .

明天我們來看評價摘要好壞的演算法。


上一篇
# Day20-Hugging Face 中文的文本生成
下一篇
# Day22-評價摘要好壞的演算法
系列文
變形金剛與抱臉怪---NLP 應用開發之實戰30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言