[Day 30] 疑？最後一天了，來看看怎麼使用Julia爬蟲好了

2019 iT 邦幫忙鐵人賽

DAY 30

自我挑戰組

When Bioinfo met Julia: Bioinformatician的30天Julia學習之路系列第 31 篇

2019鐵人賽 julialang bioinformatics web-scraping

nostalgie1211

2018-10-31 21:22:57

3983 瀏覽

分享至

最後一天

收集資料也是生物資訊領域也很重要的一環，而且不是每個資料庫網站都很貼心地提供API讓我們很同意地取得資料，故這最後一天姑且讓我們來看看怎麼使用Julia進行爬蟲好了。

所需工具

HTTP.jl：用來提供HTTP方法
Gumbo.jl：用來parse HTML
Cascadia.jl：提供CSS selector

一個例子

今天我們想要爬一下StackOverflow上面關於julia-lang這個關鍵字的所有問題及連結，另外我想看一下這些問題是否有被回答或討論，以及我還想知道這個問題得到多少票，我們可以這樣做

using HTTP
using Gumbo
using Cascadia
keyword = "julia"
url = "https://stackoverflow.com/questions/tagged/$keyword"

response = HTTP.get(url)
html = parsehtml(String(response.body))
questionsummary = eachmatch(Selector(".question-summary"),html.root)
for qs in questionsummary
    votes = nodeText(eachmatch(Selector(".votes .vote-count-post "), qs)[1])
    answered = length(eachmatch(Selector(".status.answered"), qs)) > 0
    href = eachmatch(Selector(".question-hyperlink"), qs)[1].attributes["href"]
    title = nodeText(eachmatch(Selector(".question-hyperlink"),qs)[1])
    println("$votes  $answered  [$title](http://stackoverflow.com$href)")
end