# Day25- Hugging Face 問答任務

2022 iThome 鐵人賽

DAY 25

AI & Data

變形金剛與抱臉怪---NLP 應用開發之實戰系列第 25 篇

14th鐵人賽 transformer hugging face azure machine learning

大魔術熊貓工程師

2022-10-10 21:19:29

3309 瀏覽

分享至

很快地我們 Hugging Face 的旅程來到了最後一個任務：問答任務啦！Question answering 一直是自然語言處理中很困難的部份。最常使用的是一種 QA 是 extractive QA，這是指從一段文本裡面識別出問題的答案，像是我們常見的搜尋引擎就是其中一種 extractive QA。

其他的 QA ，例如說 long-form QA，這是長篇大哉問的 QA，例如說人為什麼會存在這樣子的問題，還有 community QA 像是 stackoverflow 這種問題和答案成對的論壇會有的 QA。這類的QA 我們就先不討論，主要專注在 extractive QA。

SQuAD2.0 的全名是 Stanford Question Answering Dataset，這是問答任務中最經典的 dataset 之一，現在版本已經更新到 2.0 版。這個 dataset 包含文本、問題和答案，好讓 transformer 可以學習，我們就馬上來用 Hugging Face 做一個 QA 任務吧！

先載入我們的問題和文本，一樣文本是使用我們之前根據比特幣創世區塊新聞稿而產生出來的新文本。

question = "who is Mr Darling"
context = """
Alistair Darling has been forced to consider a second bailout for banks as the lending drought worsens. 

The Cancellor will decide tithin weeks whether to pump billions more into the economy as evidence mounts that the 37 billion part-nationalisation last yearr has failed to keep credit flowing,

Mr Darling, the former Liberal Democrat chancellor, admitted that the situation had become critical but insisted that there was still time to turn things around. 

He told the BBC that the crisis in the banking sector was the most serious problem facing the economy but also highlighted other issues, such as the falling value of sterling and the threat of inflation. 

"The worst fears about the banking crisis seem not to be panning out," he said, adding that there had not been a single banker arrested or charged over the crash. 

"The economy, the economy"

Mr Darling said "there's been a very, very strong recovery" since the autumn of 2008.

"There are very big problems ahead of us, not least of which is inflation. It is likely to be a very high inflation rate. "

The economy is expected to grow by 0.3% in the quarter to the end of this year.
"""

接著載入我們的 model 和 tokenizer，這裡我們使用 Deepset 團隊 fine-tuned 出來的 roberta-base-squad2 transformer。

from transformers import AutoModelForQuestionAnswering
from transformers import AutoTokenizer

model_name = "deepset/roberta-base-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

一樣使用 Pipeline，來載入我們的模型，快速完成 QA 任務。這裡我們多加了一個參數 top_k=3，會秀選出機率最高的答案前三名。

from transformers import pipeline

pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
pipe(question=question, context=context, top_k=3)

會得到：

[{'score': 0.2879013121128082,
  'start': 316,
  'end': 350,
  'answer': 'former Liberal Democrat chancellor'},
 {'score': 0.27082115411758423,
  'start': 312,
  'end': 350,
  'answer': 'the former Liberal Democrat chancellor'},
 {'score': 0.23835806548595428,
  'start': 323,
  'end': 350,
  'answer': 'Liberal Democrat chancellor'}]

很顯然這個答案太簡單了，我們換個更難的題目再試試 question = "What is the problem Mr Darling told to BBC?"
會得到：

[{'score': 0.41947606205940247,
  'start': 485,
  'end': 517,
  'answer': 'the crisis in the banking sector'},
 {'score': 0.2781969904899597,
  'start': 489,
  'end': 517,
  'answer': 'crisis in the banking sector'},
 {'score': 0.05736855790019035,
  'start': 499,
  'end': 517,
  'answer': 'the banking sector'}]