第 11 屆 iT 邦幫忙鐵人賽

DAY 30
AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 30

Day30 Scraping from IMDb with Selenium 2/2 用Selenium爬取IMDb 2/2


  1. 先進到這個網站確定目前所使用的Chrome版本。
  2. ChromeDriver下載相對應的Selenium版本,解壓縮後取得路徑。將路徑貼到程式碼中'YOURCHROMEDRIVERPATH'的位置,要注意兩點:路徑最後記得包含chromedriver.exe;以及記得將\改成\或倒斜線/。

This article intrduce scraping movie info from IMDb using Selenium and Chrome. Some pre-setup steps as follow:

  1. Go to this site to check the current Chrome versin in use.
  2. Download the corresponding version of Selenium from ChromeDriver. Unzip the file and get the path. Copy and paste the path of your ChromeDriver into 'YOURCHROMEDRIVERPATH' in the code. Please rememver to add chromedriver.exe behind the path and change the \ into \ or /.

Code Reference.

# 載入套件 import packages
from pyquery import PyQuery as pq
from selenium import webdriver
from random import randint
import time
import pandas as pd

def get_movie_info(movie_url):
    Get movie info from certain IMDB url
    d = pq(movie_url)
    movie_rating = float(d("strong span").text())
    movie_genre = [x.text() for x in d(".subtext a").items()]
    movie_released_date = movie_genre.pop()
    movie_poster = d(".poster img").attr('src')
    movie_cast = [x.text() for x in d(".primary_photo+ td a").items()]

    # 回傳資訊 return the movie info
    movie_info = {
        "Rating": movie_rating,
        "Released_Date": movie_released_date,
        "Genre": movie_genre,
        "Poster_Link": movie_poster,
        "Cast": movie_cast
    return movie_info

def get_movies(*args):
    用電影標題取得多個電影資訊 Get multiple movies' info from movie titles
    imdb_home = ""
    driver = webdriver.Chrome(executable_path="YOURCHROMEDRIVERPATH\\chromedriver.exe") # Use Chrome
    movies = dict()
    for movie_title in args:        
        driver.get(imdb_home) # 前往IMDb首頁 get to the IMDb website
        search_elem = driver.find_element_by_xpath("//input[@id='navbar-query']") # 定位搜尋欄位 find the search bar
        search_elem.send_keys(movie_title) # 輸入電影名稱 put in the movie titles
        submit_elem = driver.find_element_by_xpath("//div[@class='magnifyingglass navbarSprite']") # 定位搜尋按鈕 find the search button # 按下搜尋按鈕 click the search button
        category_movie_elem = driver.find_element_by_xpath("//ul[@class='findTitleSubfilterList']/li[1]/a") # 限縮搜尋結果為「電影」類 only get search results under movie catagory # 按下限縮搜尋結果 click to search
        first_result_elem = driver.find_element_by_xpath("//tr[@class='findResult odd'][1]/td[@class='result_text']/a") # 定位搜尋結果連結 find the result link # 按下搜尋結果連結 click the result link
        # 呼叫 get_movie_info()
        current_url = driver.current_url
        movie_info = get_movie_info(current_url)
        movies[movie_title] = movie_info
        time.sleep(randint(3, 8)) # 搜尋動作間停一下 set a short stop between search
    return movies

# 抓取清單中的電影 get the movie info by titles in the list
mov = get_movies("The Dressmaker", "Mr. Nobody", "Fanxiao")

df = pd.DataFrame.from_dict(mov, orient='columns')

本篇程式碼請參考Github。The code is available on Github.

Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料:

[1] 透過操控瀏覽器擷取網站資料

[2] What version of Chrome do I have?

[3] ChromeDriver - WebDriver for Chrome

[4] IMDb

[5] Stack Overflow

Day29 Scraping from IMDb with Selenium 1/2 用Selenium爬取IMDb 1/2
Hands on Data Cleaning and Scraping 資料清理與爬蟲實作30