本專題爬蟲系列文章:
Python scrapy 爬取 Y combinator Blog
Python requests 模擬網站登入爬蟲
Python requests 與api 破解動態載入網頁爬蟲
Python Selenium 網站連續換頁爬蟲
大家還記得上一篇我們在篩選物件時使用的方法是BeautifulSoup的css selector,不過其實我們也可以用xpath來做到。另外webdriver本身也有find_element_by_css_selector等方法,也可以協助我們初步對頁面做定位,或是檢查一下有沒有load到正確的地方:)
from selenium import webdriver
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1366, 768))
display.start()
以此類推,不同的關鍵字會被成為q和oq的參數值。所以這時我們可以將想要搜尋的所有字詞先存在一個陣列:
...
import_searchs = [
'HarvardBiz Ray Wang',
'The Wall Street Journal Jason Zweig',
'BankingUX Jim Bruene',
'TMA agency James Ashton',
'Gartner Stessa Cohen',
...
]
urllib.parse.quote_plus
來把字串encode成url params就可以了:import urllib.parse as up
...
browser = webdriver.Chrome()
browser.get('https://www.google.com.tw/search?q=' + up.quote_plus(i) + '&oq=' + up.quote_plus(i) + '&aqs=chrome..69i57j69i60l3.632j0j1&sourceid=chrome&ie=UTF-8')
for i in import_searchs:
browser.get('https://www.google.com.tw/search?q=' + up.quote_plus(i) + '&oq=' + up.quote_plus(i) + '&aqs=chrome..69i57j69i60l3.632j0j1&sourceid=chrome&ie=UTF-8')
links = browser.find_elements_by_css_selector("h3.r > a")
如果我們使用xpath來抓取第一頁搜尋結果的url:
links = browser.find_elements_by_xpath('//div[@class="rc"]/h3[@class="r"]/a')
補充:xpath可以指定,如果搜尋結果的標題內含XX文字才抓取
links = browser.find_elements_by_xpath('//div[@class="rc"]/h3[@class="r"]/a[contains(text(), "LinkedIn")]')
get_attribute
把url一個一個print出來: for link in links:
print(link.get_attribute("href"))
源源不絕的url就跑出來啦;如果你有設定只print特定hostname的網址,那就會得到蠻漂亮的結果。
把以上片段組合起來:
from selenium import webdriver
from pyvirtualdisplay import Display
import urllib.parse as up
import_searchs = [
'HarvardBiz Ray Wang',
'The Wall Street Journal Jason Zweig',
'BankingUX Jim Bruene',
'TMA agency James Ashton',
'Gartner Stessa Cohen',
...
]
display = Display(visible=0, size=(1366, 768))
display.start()
browser = webdriver.Chrome()
for i in import_searchs:
browser.get('https://www.google.com.tw/search?q=' + up.quote_plus(i) + '&oq=' + up.quote_plus(i) + '&aqs=chrome..69i57j69i60l3.632j0j1&sourceid=chrome&ie=UTF-8')
links = browser.find_elements_by_xpath('//div[@class="rc"]/h3[@class="r"]/a')
for link in links:
print(link.get_attribute("href"))
browser.close()
display.stop()