爬蟲擷取小問題

網路爬蟲

Jerry08957 2019-12-10 17:32:12 ‧ 2029 瀏覽

分享至

圖中有兩張照片我想要每一張照片的連結
請問我該怎麼擷取並存入list

知道正則是[a-zA-z]+://[^\s]*.jpeg
但不知道要怎麼擷取請教一下大大
或是有正則以外的用法用屬性來找的也麻煩指導一下謝謝

米歐 iT邦新手 3 級 ‧ 2019-12-10 17:40:31 檢舉

語言是？

echochio iT邦高手 1 級 ‧ 2019-12-10 18:29:52 檢舉

是 python 嗎 ?
其他語言也可擷取的 ... ....

Jerry08957 iT邦新手 5 級 ‧ 2019-12-11 21:17:28 檢舉

是的 python 不好意思沒講

登入發表討論

2 個回答

froce

iT邦大師 1 級 ‧ 2019-12-11 00:35:31

因為你沒寫要抓什麼網頁...所以隨便選一個了。
語言是python。

首先要裝 requests-html

from requests_html import HTMLSession

url = "https://www.google.com/search?q=google&client=ubuntu&hs=Jw5&channel=fs&sxsrf=ACYBGNQoBB1ys9AAPy5g3glvXTdn8PWs7Q:1575993930420&source=lnms&tbm=isch&sa=X&ved=2ahUKEwiT7Y_zuqvmAhUIq5QKHdSHBcoQ_AUoAnoECAwQBA&biw=1920&bih=951"
r = HTMLSession().get(url)
imgs = r.html.find("img")

回應 3
分享
檢舉

Jerry08957 iT邦新手 5 級 ‧ 2019-12-11 21:34:18 檢舉

感謝大大
那擷取後會得到像這樣的
<Element 'img' src='/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png' alt='Google' height='30' width='92' onload="typeof google==='object'&&google.aft&&google.aft(this)">
不乾淨有其他能指定網址尾端是png或jpeg的方法嗎

Han iT邦研究生 1 級 ‧ 2019-12-12 10:36:40 檢舉

抓到這坨字之後使用正規表示法抓出來囉！
不過沒用過python不太清楚，看了一下應該是

import re
re.search(pattern, string)

這樣就可以擷取到你想要的網址了吧！

不過python應該也有像其他語言寫的套件
直接取得element的attribute內的src吧！
這部份就等其他大大回囉～

froce iT邦大師 1 級 ‧ 2019-12-12 10:56:19 檢舉

requests-html是一個高級的爬蟲lib包，爬出來的是他的Element物件。
以上面的例子，你可以很簡單的存取imgs下的所有src：

for img in imgs:
    src = img.attrs.get("src")

請參閱 requests-html 給的範例和API，很短。

登入發表回應

wesley41616

iT邦新手 5 級 ‧ 2019-12-12 17:59:55

以 ettoday 熱門新聞為例:

import requests
from bs4 import BeautifulSoup

url = 'https://www.ettoday.net/news/hot-news.htm'
res = requests.get(url).text
doc = BeautifulSoup(res, 'lxml')

for news in doc.select('.piece'):
    image = news.findAll('img')[0]['src']
    print(image)