iT邦幫忙

0

Semantic search BM25 COVID-19 dataset 自然語言BM25搜尋新冠文獻資料

  • 分享至 

  • xImage
  •  

延續上一篇( 連結 ) 做法,我們換一個資料庫,試試看NLP BM25 的搜尋功能如何。
資料庫來源:COVID-19 metadata.csv download from Kaggle
Dataset Description資料庫說明:
In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 500,000 scholarly articles, including over 200,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
全部資料有54GB,我們只取一個csv檔 metadata.csv 包含63,572篇論文。欄位結構如下:
https://ithelp.ithome.com.tw/upload/images/20211010/201113734eVcMx5zna.jpg
程式碼和前一篇雷同,請直接GitHub下載,這裡不再po

我們搜尋摘要欄位 abstract ,關鍵字三個 Taiwan COVID vaccine

搜尋完成後,列出BM25分數最高的前5篇,存檔。

結果如下:第一篇 有關鍵字Taiwan vaccine,但沒有COVID
https://ithelp.ithome.com.tw/upload/images/20211010/20111373zjlhlUHtAb.jpg
https://ithelp.ithome.com.tw/upload/images/20211010/20111373CCWqVYbsUK.jpg
其它篇,有COVID或Taiwan 或vaccine
https://ithelp.ithome.com.tw/upload/images/20211010/20111373ZGoDRXMBAb.jpg
https://ithelp.ithome.com.tw/upload/images/20211010/201113736pBaMSx3m6.jpg

把第一篇的abstract--> wordcloud-->存圖-->LSA summary3句話
https://ithelp.ithome.com.tw/upload/images/20211010/20111373xlHygKxD7L.jpg
https://ithelp.ithome.com.tw/upload/images/20211010/20111373QYYanHyYjI.jpg


圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言