摘要(summarization)也是自然語言處理中很常見的任務之一,今天我們就來看看 Hugging Face 如何幫我們幫我們做摘要吧!
在 Day17 講 Transformer 種類時,有提到 Encoder-Decoder 的架構是相當適合來做摘要的,我們就來用其中的代表作 T5 transformer 來做摘要吧!
input_text="""
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens.
The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,
Mr Darling, the former Liberal Democrat chancellor, admitted that the situation had become critical but insisted that there was still time to turn things around.
He told the BBC that the crisis in the banking sector was the most serious problem facing the economy but also highlighted other issues, such as the falling value of sterling and the threat of inflation.
"The worst fears about the banking crisis seem not to be panning out," he said, adding that there had not been a single banker arrested or charged over the crash.
"The economy, the economy"
Mr Darling said "there's been a very, very strong recovery" since the autumn of 2008.
"There are very big problems ahead of us, not least of which is inflation. It is likely to be a very high inflation rate. "
The economy is expected to grow by 0.3% in the quarter to the end of this year.
"""
from transformers import pipeline
pipe = pipeline("summarization", model="t5-large")
result = pipe(input_text)
result
會得到這樣的結果:
[{'summary_text': 'former lib dem chancellor forced to consider second bailout for banks . evidence mounts that 37 billion part-nationalisation last yearr has failed to keep credit flowing . darling insists that there is still time to turn things around .'}]
原文裡面 Mr. Darling 強調說經濟正在強勁的復蘇,而產生的摘要說 turn things around。真的是太有趣了。
另一套同樣是 Encoder-Decoder 的架構的模型 pegasus
,是近年來很受歡迎的摘要生成的模型,我們也來玩玩他吧!
pipe_pegasus = pipeline("summarization", model="google/pegasus-cnn_dailymail")
result_pegasus = pipe_pegasus(input_text)
result_pegasus
會得到這樣的結果,其中的 <n>
就是換行符號 \n
:
{'summary_text': 'Mr Darling admitted that the situation had become critical but insisted that there was still time to turn things around .<n>The economy is expected to grow by 0.3% in the quarter to the end of this year .'}
nltk
是一套英文常用的分詞工具,可以避免像是 U.S.
這類的縮寫因為有句號而被被當成句子的情形。import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")
string = "The U.S. are a country. Mr. White vs. Heisenberg."
sent_tokenize(string)
很多句號在句子內,但是還是會被拆成兩句話。
['The U.S. are a country.', 'Mr. White vs. Heisenberg.']
paragraph_result_T5 = "\n".join(sent_tokenize(result[0]["summary_text"]))
print(paragraph_result_T5)
paragraph_result_pegasus = "\n".join(sent_tokenize(result_pegasus[0]["summary_text"].replace(" .<n>", " .\n")))
print(paragraph_result_pegasus)
會得到下面的結果,看起來舒服多了:
former lib dem chancellor forced to consider second bailout for banks .
evidence mounts that 37 billion part-nationalisation last yearr has failed to keep credit flowing .
darling insists that there is still time to turn things around .
Mr Darling admitted that the situation had become critical but insisted that there was still time to turn things around .
The economy is expected to grow by 0.3% in the quarter to the end of this year .
明天我們來看評價摘要好壞的演算法。