iT邦幫忙

2022 iThome 鐵人賽

DAY 19
0

今天我們來講講怎麼優化文本生成。

Greedy Search

  1. 所謂的貪婪搜尋,在 Hugging Face 就不用自己實做了,只要設定這樣子的參數就可以了:num_beams=1, do_sample=False。程式碼如下:
max_length = 128

### 昨天的句子太簡單不好玩,今天改成比特幣創世區塊的新聞稿
input_txt = """
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. \n
The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that \
the 37 billion part-nationalisation last yearr has failed to keep credit flowing,
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_length=max_length, num_beams=1,  do_sample=False)
print(tokenizer.decode(output[0]))

會得到:

Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

The Chancellor will also consider whether to take a second look at the Bank of England's plans to raise interest rates, which have been held at a record low of 0.5 per cent since March 2009.

The Chancellor will also consider whether to take a second look at the Bank of England's plans to raise

我的句子中的 Typo Cancellor,居然自動生成為 Chancellor 了,真是太強大!

  1. 如果我們把 max_length = 128 變成 256 再看看,會得到下面的結果:
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

The Chancellor will also consider whether to take a second look at the Bank of England's plans to raise interest rates, which have been held at a record low of 0.5 per cent since March 2009.

The Chancellor will also consider whether to take a second look at the Bank of England's plans to raise interest rates, which have been held at a record low of 0.5 per cent since March 2009.

The Chancellor will also consider whether to take a second look at the Bank of England's plans to raise interest rates, which have been held at a record low of 0.5 per cent since March 2009.

The Chancellor will also consider whether to take a second look at the Bank of England's plans to raise interest rates, which have been held at a record low of 0.5 per cent since March 2009.

The Chancellor will also consider whether to take a second look at the Bank of England's plans to
  1. 我們可以看到大量重複的字句出來,這個是貪婪搜尋的缺點。

Beam Search

  1. 為了解決重複字句的問題,發展出了 Beam Search 的算法,就是上面那個參數 num_beams。如果我們就來設定為 num_beams=3,會得到下面的結果
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

Mr Darling said: "We have to look at what we can do to help the banks. 
"We have to look at what we can do to help the economy. 
"We have to look at what we can do to help the banks. 
"We have to look at what we can do to help the economy. 
"We have to look at what we can do to help the banks. 
"We have to look at what we can do to help the banks. 
"We have to look at what we can do to help the banks.
"We have to look at what we can do to help the banks.
"We have to look at what we can do to help the banks.
"We have to look at what we can do to help the banks. 
"We have to look at what we can do to help the banks.
  1. 好的,還是有大量重複的句子,但是至少重複的內容不是完全一樣:banks 和 economy 會交換,不像 Greedy 一直重複。接著我們可以用 no_repeat_ngram_size 來優化,就設定no_repeat_ngram_size=5 吧,會得下面的結果。
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

Mr Darling said: "We have to look at what we can do to help the banks.

"We have to think about what we can do for the economy.

"It is not just about the banks, it is about the economy as a whole.

"If we don't do something to help the banks, the economy is going to suffer."

He added: "We have got to get the banks lending again.
  1. 這下子就通順多了,也沒有大量重複的句子。

Sampling

  1. Sampling 簡單來說就是在整個詞彙表上的機率分佈中隨機抽樣。參數可以設定為num_beams=1, do_sample=True
  2. 但有時候這個抽樣會太過隨機,極低機率會出現生成的文本和前面討論的主題不一樣的情形。大家可以多抽幾次玩玩看,像我就抽過一次生成文本在討論醫學,和前文討論經濟的狀況不符。
  3. 在做 Sampling 的過程中,我們可以多加上 temperature 來提高輸出的多樣性。設定temperature=1.5 來玩玩看吧!我們可以得到下面的結果:
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

Darling told The Times there might be options including:

RBS to provide capital on the equivalent of 30pc to 40pc loans: £5billion for the bank: and in lieu of the cash the Treasury agreed to write off a loan of 10bn from Lloyds.

He added: "It's very critical we help the struggling British institutions, rather than just a very narrowly-tied solution." 

He said the government had only put up 10 billion in funds – the Bank's 40pc – so would "need another 20 or 30 billion" to bridge an already yawning £120bn hole to public coffers.<|endoftext|>
  1. 雖然內容又更多樣化了,但是溫度愈高,愈加容易出現生成文本離題的問題,因此需要在一致性(低溫)和多樣性(高溫)之間進行權衡。

Top-k 和 Top-p

這兩位是另一種 Sampling 的手法,一樣是從機率分布中抽樣,但是可以避免一些機率出現的字詞,

  1. 我們可以設定 do_sample=True, top_k=50 來玩玩看,會得到下面的結果。
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

Ministers last night confirmed they were trying to get a green light to extend the Emergency Liquidity Assistance (ELA) that banks were forced to take from last year.

(REUTERS) – Europe's second largest economy faces a major threat to its finances, with the country's banks due to borrow around 27 billion euros of fresh funds from the government to keep their operations afloat this year, a week before they are expected to be sold. If approved, the government would also be forced to spend around 10 billion euros to prop up the weak banking sector, which already sits below 3.5 billion euros.

"In the coming weeks we will have to consider whether it makes sense to go for another ELA," a senior government source said.

The announcement of a third "stress test" – a test conducted by the EU's banking watchdog the European Banking Authority (EBA) – came as a shock to eurozone investors, who

我們可以看到產生的內容好像更加一回事了。

  1. 我們可以設定 do_sample=True, top_p=0.95 來玩玩看,會得到下面的文本:
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

Mr Darling, the former Liberal Democrat chancellor, admitted that the situation had become critical but insisted that there was still time to turn things around. 

He told the BBC that the crisis in the banking sector was the most serious problem facing the economy but also highlighted other issues, such as the falling value of sterling and the threat of inflation. 

"The worst fears about the banking crisis seem not to be panning out," he said, adding that there had not been a single banker arrested or charged over the crash. 

The economy, the economy

Mr Darling said "there's been a very, very strong recovery" since the autumn of 2008.

"There are very big problems ahead of us, not least of which is inflation. It is likely to be a very high inflation rate. "

The economy is expected to grow by 0.3% in the quarter to the end of this year,

以上就是文本生成常用到的一些手法了,大家可以盡情地排列組合,以調整出最適合你應用的參數。如果是精確任務或為特定問題提供答案,那麼降低溫度和使用 greedy 或 beam search 是比較好的方法。如果是生成更長的文本,甚至有點創意,則可以用 Sampling 的方法。

其實在做 Greedy 時,出現 the Bank of England,我已經又驚又喜了,因為那串文本的下一段,就是 The Bank of England 開頭的段落。


上一篇
# Day18-Hugging Face 文本生成入門
下一篇
# Day20-Hugging Face 中文的文本生成
系列文
變形金剛與抱臉怪---NLP 應用開發之實戰30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言