iT邦幫忙

第 11 屆 iThome 鐵人賽

DAY 28
0
AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 28

Day28 BS4: Scrape from Youtube 2/2 用美麗的湯爬取Youtube 2/2

延續昨日的文章,今天要把Berge的Youtube搜尋頁面縮圖連結存下來。
Today we will continue last article to scrape the thumbnail URLs from the search result of Berge on Youtube.

# 同昨天的起始步驟,載入套件、創建美麗湯物件
# Same set up steps as yesterday, import packages, set up a beautiful soup object
import requests
from bs4 import BeautifulSoup

url = "https://www.youtube.com/results?search_query=Berge"
request = requests.get(url)
content = request.content
soup = BeautifulSoup(content, "html.parser")
# 印出標題以及URL,點開對照確認抓取的資料無誤
# print out titles and URLs to check that we have the correct URLs scraped
for vid in soup.select(".yt-lockup-video"):
    data = vid.select("a[rel='spf-prefetch']")
    print(data[0].get("title"))
    img = vid.select("img")
    if img[0].get("src") != "/yts/img/pixel-vfl3z5WfW.gif":
        print(img[0].get("src"))
    else:
        print(img[0].get("data-thumb"))
    print("-------------------")

https://ithelp.ithome.com.tw/upload/images/20190927/201197098qLVR8F09b.jpg
https://ithelp.ithome.com.tw/upload/images/20190927/201197097fmVwJY4Ew.jpg
https://ithelp.ithome.com.tw/upload/images/20190927/20119709ObtvnEiZJd.jpg

# 把我們抓取到的URL存成清單以利稍後加入昨天的資料框架中
# save the URLs into a list so we can then add them into the dataframe we created yesterday

img_url = []
for vid in soup.select(".yt-lockup-video"):
    img = vid.select("img")
    # 發現如果src="/yts/img/pixel-vfl3z5WfW.gif",URL是存在data-thumb;否則就是直接存在src
    # if the src="/yts/img/pixel-vfl3z5WfW.gif", URLs are in data-thumb. otherwise it's in src
    if img[0].get("src") != "/yts/img/pixel-vfl3z5WfW.gif":
        img_url.append(img[0].get("src"))
    else:
        img_url.append(img[0].get("data-thumb"))
print(img_url[:3])

https://ithelp.ithome.com.tw/upload/images/20190927/20119709bEa42nmJyk.jpg

把縮圖連結加入昨天的資料框架中存成csv檔

Save the url of the thumbnails with the dataframe we created yesterday and save as a new file.

import pandas as pd 
# 讀入昨天存的檔案來分析 read in the file we created yesterday
berge = pd.read_csv('berge.csv') 
berge.info() # 查看資料細節 the info of data
berge.head(3) # 叫出前三筆資料看看 print out the top three rows of data

https://ithelp.ithome.com.tw/upload/images/20190927/20119709P4JwPzUETR.jpg

berge['Img_URL'] = img_url # 新增欄位 add the ima_url as a new column 
berge.head(3)

https://ithelp.ithome.com.tw/upload/images/20190927/20119709NjAD8hxaK7.jpg

berge.to_csv('berge_final.csv') # 儲存新的csv檔 save the new file

來看看這些縮圖

Have a look at the thumbnails

from PIL import Image
from io import BytesIO
import numpy as np
import matplotlib.pyplot as plt

# 取得連結 get the link of URLs
response = requests.get(img_url[0]) 
img = Image.open(BytesIO(response.content)) 

# 轉成Numpy陣列等等要繪圖 convert img to numpy array so we can then plot them out
img = np.array(img)
plt.imshow(img)
plt.show()

https://ithelp.ithome.com.tw/upload/images/20190927/20119709RWdh5bWqRN.png

# 把URL都轉成陣列並存進清單 convert all the URLs into Numpy array then append into a list
thumbnail = []
for u in img_url:
    response = requests.get(u)
    try:
        img = Image.open(BytesIO(response.content))

    except OSError:
        continue
    img = np.array(img)
    thumbnail.append(img)
for t in thumbnail:
    plt.imshow(t)
    plt.show()

https://ithelp.ithome.com.tw/upload/images/20190927/20119709l7jM68LENj.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709bMxHbvlYWa.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709dAQw9WfrTi.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709jHRWxljYd8.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709Pz4qLoCbNC.png
https://ithelp.ithome.com.tw/upload/images/20190927/201197093MUknp5HMr.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709HNWZDmINWn.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709YYa7F0QMKo.png
https://ithelp.ithome.com.tw/upload/images/20190927/201197097yqtrGMIOp.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709jZMVochcvg.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709xdJ6Rcr6Wp.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709OXI0XRBkuN.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709vNOVJtCs7S.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709PN76Duo51x.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709SXlQB3hI14.png
https://ithelp.ithome.com.tw/upload/images/20190927/201197095wY5bTEHyF.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709ff5tSIjpwq.png
https://ithelp.ithome.com.tw/upload/images/20190927/20119709RCh5M0kMVw.png
https://ithelp.ithome.com.tw/upload/images/20190927/201197097d0d7h7Els.png

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料:

[1] Beautiful Soup 4.2.0

[2] 爬蟲實戰-Youtube

[3] Youtube

[4] IndexError

[5] 第二屆機器學習百日馬拉松內容

[6] Adding new column to existing DataFrame in Pandas


上一篇
Day27 BS4 Scrape from Youtube 1/2 用美麗的湯爬取Youtube 1/2
下一篇
Day29 Scraping from IMDb with Selenium 1/2 用Selenium爬取IMDb 1/2
系列文
Hands on Data Cleaning and Scraping 資料清理與爬蟲實作30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言