Day12：Selenium webdriver 定位物件方法比較 xpath v.s. css selector｜Kearch 1.0 爬蟲關鍵字報表工具

2018 iT 邦幫忙鐵人賽

DAY 10

Software Development

[行銷也要自動化] 用 Python Selenium + NodeJS + Amazon EC2 打造簡易關鍵字搜尋報表應用！系列第 13 篇

Day12：Selenium webdriver 定位物件方法比較 xpath v.s. css selector｜Kearch 1.0 爬蟲關鍵字報表工具

2018鐵人賽 selenium xpath urllib 行銷技術控

Kyle

2017-12-31 13:28:30

19802 瀏覽

分享至

本專題爬蟲系列文章：

Python scrapy 爬取 Y combinator Blog
Python requests 模擬網站登入爬蟲
 Python requests 與api 破解動態載入網頁爬蟲
 Python Selenium 網站連續換頁爬蟲

大家還記得上一篇我們在篩選物件時使用的方法是BeautifulSoup的css selector，不過其實我們也可以用xpath來做到。另外webdriver本身也有find_element_by_css_selector等方法，也可以協助我們初步對頁面做定位，或是檢查一下有沒有load到正確的地方：）

假設我們想抓取多個關鍵字，在Google搜尋結果第一頁的所有連結。首先import東西，並把virtualdisplay啟動：

from selenium import webdriver
from pyvirtualdisplay import Display

display = Display(visible=0, size=(1366, 768))
display.start()

觀察Google搜尋結果的網址，就會發現如果搜尋"give me keyword"，網址會是：

https://www.google.com.tw/search?q=give+me+keyword&oq=give+me+keyword&aqs=chrome..69i57j69i60l3.632j0j1&sourceid=chrome&ie=UTF-8

以此類推，不同的關鍵字會被成為q和oq的參數值。所以這時我們可以將想要搜尋的所有字詞先存在一個陣列：

...
import_searchs = [
    'HarvardBiz Ray Wang',
    'The Wall Street Journal Jason Zweig',
    'BankingUX Jim Bruene',
    'TMA agency James Ashton',
    'Gartner Stessa Cohen',
    ...
]

為什麼要存成陣列？因為我們希望它可以輪流接續跑多次的搜尋結果；不過你一定也觀察到了，在搜尋字詞的空白處，進到url時會被encode成"+"號。這個要怎麼做到呢？
很簡單，我們用python的urllib.parse.quote_plus來把字串encode成url params就可以了：

import urllib.parse as up

...

browser = webdriver.Chrome()
browser.get('https://www.google.com.tw/search?q=' + up.quote_plus(i) + '&oq=' + up.quote_plus(i) + '&aqs=chrome..69i57j69i60l3.632j0j1&sourceid=chrome&ie=UTF-8')

接下來是讓它跑每一個字串都做一次爬蟲：

for i in import_searchs:
    browser.get('https://www.google.com.tw/search?q=' + up.quote_plus(i) + '&oq=' + up.quote_plus(i) + '&aqs=chrome..69i57j69i60l3.632j0j1&sourceid=chrome&ie=UTF-8')

find_element_by_css_selector

接著因為搜尋結果不只一個（find_elements會形成list），所以我們需要先存起來，待會在一個一個print出。如果我們使用css selector來抓取第一頁搜尋結果的url：

    links = browser.find_elements_by_css_selector("h3.r > a")

find_element_by_xpath

如果我們使用xpath來抓取第一頁搜尋結果的url：

    links = browser.find_elements_by_xpath('//div[@class="rc"]/h3[@class="r"]/a')

補充：xpath可以指定，如果搜尋結果的標題內含XX文字才抓取

links = browser.find_elements_by_xpath('//div[@class="rc"]/h3[@class="r"]/a[contains(text(), "LinkedIn")]')

使用get_attribute把url一個一個print出來：

    for link in links:
        print(link.get_attribute("href"))

源源不絕的url就跑出來啦；如果你有設定只print特定hostname的網址，那就會得到蠻漂亮的結果。

總結

把以上片段組合起來：

from selenium import webdriver
from pyvirtualdisplay import Display
import urllib.parse as up

import_searchs = [
    'HarvardBiz Ray Wang',
    'The Wall Street Journal Jason Zweig',
    'BankingUX Jim Bruene',
    'TMA agency James Ashton',
    'Gartner Stessa Cohen',
    ...
]

display = Display(visible=0, size=(1366, 768))
display.start()
browser = webdriver.Chrome()

for i in import_searchs:
    browser.get('https://www.google.com.tw/search?q=' + up.quote_plus(i) + '&oq=' + up.quote_plus(i) + '&aqs=chrome..69i57j69i60l3.632j0j1&sourceid=chrome&ie=UTF-8')
    links = browser.find_elements_by_xpath('//div[@class="rc"]/h3[@class="r"]/a')
    
    for link in links:
        print(link.get_attribute("href"))


browser.close()
display.stop()