Python 爬蟲 selenium+phantomJS 獲取內容不完整

python3 phantomjs selenium python crawler

ivan4 2017-12-01 04:59:51 ‧ 17716 瀏覽

分享至

初學Python 爬蟲
想爬蝦皮的圖片但是一直不成功
使用 selenium+phantomJS 獲取內容不完整

使用chrome可以爬出來
但是phantomJS就會出不來

from selenium import webdriver
from urllib import request                
from bs4 import BeautifulSoup             
from urllib.parse import urlparse         
from urllib.request import urlopen
driver = webdriver.PhantomJS(executable_path=r'C:\selenium_driver_chrome\phantomjs-2.1.1-windows\bin\phantomjs.exe')  # PhantomJs
driver.get('https://shopee.tw/search/?keyword=%E5%B7%A7%E9%BA%97%E8%85%AE%E7%B4%85&sortBy=ctime')  
ps = driver.page_source 
sp = BeautifulSoup(ps, "lxml")
spimgs=[]
spimgss =sp.find("div",{"class":"lazy-image__image"})
print(spimgss)

這裡""lazy-imageimage" 標籤底下許多東西都會顯示不出來

<div class="lazy-image__image"></div>

但是用chromedriver就能顯示

<div class="lazy-image__image" style='background-image: url("https://cfshopeetw-a.akamaihd.net/file/f6a861e0d7c301c9223d4ac6f27c6203_tn");'></div>

請教各位高手!

fillano iT邦超人 1 級 ‧ 2017-12-01 16:21:26 檢舉

用page_source，跟你在瀏覽器（以Chrome為例）按下滑鼠右鍵後選「檢視網頁原始碼」是類似的意思吧？我想他只載入一個框架，內容是用javascript的方式修改element後顯示的。要不要直接用webdriver來抓抓看？

http://www.seleniumhq.org/docs/03_webdriver.jsp#locating-ui-elements-webelements

froce iT邦大師 1 級 ‧ 2017-12-01 21:09:59 檢舉

webdriver不是能直接使用的東西，他是控制browser的中介層。
https://www.w3.org/TR/webdriver/
所以selenium理論上能模擬人上網的行為，因為他是控制browser去選取、觸發網頁元素的。
selenium是個很邪惡的東西。嘻嘻
---
喔，我搞錯大神的意思了，不要理我。

登入發表討論

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

2 個回答

froce

iT邦大師 1 級 ‧ 2017-12-01 21:18:56

1.http://selenium-python.readthedocs.io/locating-elements.html
selenium本身就有選取元素的功能了，不需要再使用BS4去選。

2.沒實際測，也不想測，因為設置很麻煩，不過我猜跟browser的javascript解釋方式有關。

3.如果你還是使用桌面環境，但是不想讓操作畫面浮現，可以使用headless模式。Chrome 60以後應該可以在windows上用，我沒試過。
https://intoli.com/blog/running-selenium-with-headless-chrome/
https://stackoverflow.com/questions/43880619/headless-chrome-and-selenium-on-windows

回應
分享
檢舉

登入發表回應

fillano

iT邦超人 1 級 ‧ 2017-12-02 15:43:17

補充一下，我用node.js上的selenium-webdriver跑了一下看看（我的phantomjs driver有問題，所以是用safari跑，懶得再去設定），除了我前面說的問題外，更重要的是，所有的圖片是在「看到」，也就是在往下捲動進到瀏覽器的可視範圍才會觸發Javascript的動作，把圖片載入。這樣你用page_source更不可能做到。你不只要用webdriver來找元素，在這之前，還要先控制頁面捲動，直到頁面底部為止，然後等待一下，讓Javascript有時間跑完，再來找到目標元素。

不過我這樣測試也有陷阱：

我用safari webdriver測試時，只有我看到的圖片有抓到，其他都是null，所以我做上面的假設
因為我不知道phantomjs會怎樣設定可視範圍，說不定還有其他陷阱XD，最好還是用phantomjs跑一下比較保險

回應 3
分享
檢舉

froce iT邦大師 1 級 ‧ 2017-12-03 13:22:44 檢舉

其實我看了一下蝦皮的soure code，我還蠻訝異可以抓到.lazy-image__image的元素。

scroll可以參照這篇，只是他是用java。
https://stackoverflow.com/questions/12293158/page-scroll-up-or-down-in-selenium-webdriver-selenium-2-using-java

fillano iT邦超人 1 級 ‧ 2017-12-04 17:55:54 檢舉

寫一段可以抓到圖片資訊的node.js程式：

const {Builder, By, Key, until} = require('selenium-webdriver');

let driver = new Builder()
    .forBrowser('phantomjs')
    .build();

function waitThenable(ms) {
	return new Promise((resolve, reject) => {
		setTimeout(() => resolve(), ms);
	});
}

driver
	.then(d => d.get('https://shopee.tw/search/?keyword=%E5%B7%A7%E9%BA%97%E8%85%AE%E7%B4%85&sortBy=ctime'))
	.then(d => driver.executeScript('window.resizeTo(1024, 768);window.scroll(0,document.documentElement.scrollHeight);'))
    //.then(() => waitThenable(10))
    .then(() => {
        return driver.findElements(By.className('lazy-image__image'));
    })
    .then(elms => {
    	elms.forEach(elm => {
    		driver.actions()
    			.mouseMove(elm, {x:10, y:10})
    			.perform()
    			.then(() => waitThenable(300))
    			.then(() => elm.getAttribute('style').then(n => console.log(n)));
    	});
    })
    .then(() => driver.takeScreenshot())
    .then(img => require('fs').writeFile('catch.png', img, 'base64', err => {if(!!err) console.log(err)}))
    .then(() => driver.close())
    .catch(msg => console.log(msg));

最後會把圖拍下來...