Day30 Scraping from IMDb with Selenium 2/2 用Selenium爬取IMDb 2/2

第 11 屆 iThome 鐵人賽

DAY 30

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 30 篇

11th鐵人賽 beautifulsoup pandas imdb chromedriver

kyt

2019-10-01 06:59:33

1671 瀏覽

分享至

本篇文章是使用Chrome瀏覽器搭配Selenium爬取電影評分網站IMDb資料，一些前置作業如下：

先進到這個網站確定目前所使用的Chrome版本。
到ChromeDriver下載相對應的Selenium版本，解壓縮後取得路徑。將路徑貼到程式碼中'YOURCHROMEDRIVERPATH'的位置，要注意兩點：路徑最後記得包含chromedriver.exe；以及記得將\改成\或倒斜線/。

This article intrduce scraping movie info from IMDb using Selenium and Chrome. Some pre-setup steps as follow:

Go to this site to check the current Chrome versin in use.
Download the corresponding version of Selenium from ChromeDriver. Unzip the file and get the path. Copy and paste the path of your ChromeDriver into 'YOURCHROMEDRIVERPATH' in the code. Please rememver to add chromedriver.exe behind the path and change the \ into \ or /.

程式碼是參考自這篇文章。
Code Reference.

# 載入套件 import packages
from pyquery import PyQuery as pq
from selenium import webdriver
from random import randint
import time
import pandas as pd

def get_movie_info(movie_url):
    """
    Get movie info from certain IMDB url
    """
    d = pq(movie_url)
    movie_rating = float(d("strong span").text())
    movie_genre = [x.text() for x in d(".subtext a").items()]
    movie_released_date = movie_genre.pop()
    movie_poster = d(".poster img").attr('src')
    movie_cast = [x.text() for x in d(".primary_photo+ td a").items()]

    # 回傳資訊 return the movie info
    movie_info = {
        "Rating": movie_rating,
        "Released_Date": movie_released_date,
        "Genre": movie_genre,
        "Poster_Link": movie_poster,
        "Cast": movie_cast
    }
    return movie_info

def get_movies(*args):
    """
    用電影標題取得多個電影資訊 Get multiple movies' info from movie titles
    """
    imdb_home = "https://www.imdb.com/"
    driver = webdriver.Chrome(executable_path="YOURCHROMEDRIVERPATH\\chromedriver.exe") # Use Chrome
    movies = dict()
    for movie_title in args:        
        driver.get(imdb_home) # 前往IMDb首頁 get to the IMDb website
        search_elem = driver.find_element_by_xpath("//input[@id='navbar-query']") # 定位搜尋欄位 find the search bar
        search_elem.send_keys(movie_title) # 輸入電影名稱 put in the movie titles
        submit_elem = driver.find_element_by_xpath("//div[@class='magnifyingglass navbarSprite']") # 定位搜尋按鈕 find the search button
        submit_elem.click() # 按下搜尋按鈕 click the search button
        category_movie_elem = driver.find_element_by_xpath("//ul[@class='findTitleSubfilterList']/li[1]/a") # 限縮搜尋結果為「電影」類 only get search results under movie catagory
        category_movie_elem.click() # 按下限縮搜尋結果 click to search
        first_result_elem = driver.find_element_by_xpath("//tr[@class='findResult odd'][1]/td[@class='result_text']/a") # 定位搜尋結果連結 find the result link
        first_result_elem.click() # 按下搜尋結果連結 click the result link
        
        # 呼叫 get_movie_info()
        current_url = driver.current_url
        movie_info = get_movie_info(current_url)
        movies[movie_title] = movie_info
        time.sleep(randint(3, 8)) # 搜尋動作間停一下 set a short stop between search
    driver.close()
    return movies

# 抓取清單中的電影 get the movie info by titles in the list
mov = get_movies("The Dressmaker", "Mr. Nobody", "Fanxiao")

df = pd.DataFrame.from_dict(mov, orient='columns')
df.transpose()