本篇文章是使用Chrome瀏覽器搭配Selenium爬取電影評分網站IMDb資料,一些前置作業如下:
This article intrduce scraping movie info from IMDb using Selenium and Chrome. Some pre-setup steps as follow:
程式碼是參考自這篇文章。
Code Reference.
# 載入套件 import packages
from pyquery import PyQuery as pq
from selenium import webdriver
from random import randint
import time
import pandas as pd
def get_movie_info(movie_url):
"""
Get movie info from certain IMDB url
"""
d = pq(movie_url)
movie_rating = float(d("strong span").text())
movie_genre = [x.text() for x in d(".subtext a").items()]
movie_released_date = movie_genre.pop()
movie_poster = d(".poster img").attr('src')
movie_cast = [x.text() for x in d(".primary_photo+ td a").items()]
# 回傳資訊 return the movie info
movie_info = {
"Rating": movie_rating,
"Released_Date": movie_released_date,
"Genre": movie_genre,
"Poster_Link": movie_poster,
"Cast": movie_cast
}
return movie_info
def get_movies(*args):
"""
用電影標題取得多個電影資訊 Get multiple movies' info from movie titles
"""
imdb_home = "https://www.imdb.com/"
driver = webdriver.Chrome(executable_path="YOURCHROMEDRIVERPATH\\chromedriver.exe") # Use Chrome
movies = dict()
for movie_title in args:
driver.get(imdb_home) # 前往IMDb首頁 get to the IMDb website
search_elem = driver.find_element_by_xpath("//input[@id='navbar-query']") # 定位搜尋欄位 find the search bar
search_elem.send_keys(movie_title) # 輸入電影名稱 put in the movie titles
submit_elem = driver.find_element_by_xpath("//div[@class='magnifyingglass navbarSprite']") # 定位搜尋按鈕 find the search button
submit_elem.click() # 按下搜尋按鈕 click the search button
category_movie_elem = driver.find_element_by_xpath("//ul[@class='findTitleSubfilterList']/li[1]/a") # 限縮搜尋結果為「電影」類 only get search results under movie catagory
category_movie_elem.click() # 按下限縮搜尋結果 click to search
first_result_elem = driver.find_element_by_xpath("//tr[@class='findResult odd'][1]/td[@class='result_text']/a") # 定位搜尋結果連結 find the result link
first_result_elem.click() # 按下搜尋結果連結 click the result link
# 呼叫 get_movie_info()
current_url = driver.current_url
movie_info = get_movie_info(current_url)
movies[movie_title] = movie_info
time.sleep(randint(3, 8)) # 搜尋動作間停一下 set a short stop between search
driver.close()
return movies
# 抓取清單中的電影 get the movie info by titles in the list
mov = get_movies("The Dressmaker", "Mr. Nobody", "Fanxiao")
df = pd.DataFrame.from_dict(mov, orient='columns')
df.transpose()
本篇程式碼請參考Github。The code is available on Github.
文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.
Reference 參考資料:
[1] 透過操控瀏覽器擷取網站資料
[2] What version of Chrome do I have?
[3] ChromeDriver - WebDriver for Chrome
[4] IMDb
[5] Stack Overflow