Day28 BS4: Scrape from Youtube 2/2 用美麗的湯爬取Youtube 2/2

第 11 屆 iThome 鐵人賽

DAY 28

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 28 篇

11th鐵人賽 beautifulsoup pandas youtube webscraping

kyt

2019-09-29 06:44:17

3186 瀏覽

分享至

延續昨日的文章，今天要把Berge的Youtube搜尋頁面縮圖連結存下來。
Today we will continue last article to scrape the thumbnail URLs from the search result of Berge on Youtube.

# 同昨天的起始步驟，載入套件、創建美麗湯物件
# Same set up steps as yesterday, import packages, set up a beautiful soup object
import requests
from bs4 import BeautifulSoup

url = "https://www.youtube.com/results?search_query=Berge"
request = requests.get(url)
content = request.content
soup = BeautifulSoup(content, "html.parser")

# 印出標題以及URL，點開對照確認抓取的資料無誤
# print out titles and URLs to check that we have the correct URLs scraped
for vid in soup.select(".yt-lockup-video"):
    data = vid.select("a[rel='spf-prefetch']")
    print(data[0].get("title"))
    img = vid.select("img")
    if img[0].get("src") != "/yts/img/pixel-vfl3z5WfW.gif":
        print(img[0].get("src"))
    else:
        print(img[0].get("data-thumb"))
    print("-------------------")

# 把我們抓取到的URL存成清單以利稍後加入昨天的資料框架中
# save the URLs into a list so we can then add them into the dataframe we created yesterday

img_url = []
for vid in soup.select(".yt-lockup-video"):
    img = vid.select("img")
    # 發現如果src="/yts/img/pixel-vfl3z5WfW.gif"，URL是存在data-thumb；否則就是直接存在src
    # if the src="/yts/img/pixel-vfl3z5WfW.gif", URLs are in data-thumb. otherwise it's in src
    if img[0].get("src") != "/yts/img/pixel-vfl3z5WfW.gif":
        img_url.append(img[0].get("src"))
    else:
        img_url.append(img[0].get("data-thumb"))
print(img_url[:3])

把縮圖連結加入昨天的資料框架中存成csv檔

Save the url of the thumbnails with the dataframe we created yesterday and save as a new file.

import pandas as pd 
# 讀入昨天存的檔案來分析 read in the file we created yesterday
berge = pd.read_csv('berge.csv') 
berge.info() # 查看資料細節 the info of data
berge.head(3) # 叫出前三筆資料看看 print out the top three rows of data

berge['Img_URL'] = img_url # 新增欄位 add the ima_url as a new column 
berge.head(3)

berge.to_csv('berge_final.csv') # 儲存新的csv檔 save the new file

來看看這些縮圖

Have a look at the thumbnails

from PIL import Image
from io import BytesIO
import numpy as np
import matplotlib.pyplot as plt

# 取得連結 get the link of URLs
response = requests.get(img_url[0]) 
img = Image.open(BytesIO(response.content)) 

# 轉成Numpy陣列等等要繪圖 convert img to numpy array so we can then plot them out
img = np.array(img)
plt.imshow(img)
plt.show()

# 把URL都轉成陣列並存進清單 convert all the URLs into Numpy array then append into a list
thumbnail = []
for u in img_url:
    response = requests.get(u)
    try:
        img = Image.open(BytesIO(response.content))

    except OSError:
        continue
    img = np.array(img)
    thumbnail.append(img)

for t in thumbnail:
    plt.imshow(t)
    plt.show()

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正，感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料：

[1] Beautiful Soup 4.2.0

[2] 爬蟲實戰-Youtube

[3] Youtube

[4] IndexError

[5] 第二屆機器學習百日馬拉松內容

[6] Adding new column to existing DataFrame in Pandas

Day27 BS4 Scrape from Youtube 1/2 用美麗的湯爬取Youtube 1/2

Day29 Scraping from IMDb with Selenium 1/2 用Selenium爬取IMDb 1/2

系列文

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作共 30 篇

RSS系列文訂閱系列文

25 人訂閱

完整目錄

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22206 篇

完賽人數

600 人

AI 時代資料處理殺手鐧：Microsoft Fabric + Power BI Copilot 讓你用說的畫圖表

IT EXPLAINED |

40 分

2022 臺灣雲端大會 x 亞利安科技 | 萬物上雲！WAAP 高數位時代企業資安防護守門員

CipherTech 亞利安科技 |

25 分

如何建立软件唯一可信源，提升DevOps效率？

DevOpsDays |

34 分

Advanced Email Security與沙箱技術大翻盤，沙箱 ≠ 黑箱！

IT EXPLAINED |

35 分

如何盡量避免 Throttling 在 K8s 中

Kubernetes Summit |

29 分

錦囊妙招！WiFi 6 網速效能大解放

Zyxel兆勤科技- 網路通訊專家 |

35 分

下一世代 SQL 重磅推出 - 以 Azure SQL MI 打造 AI 時代雲地整合資料庫

Cloud Summit 臺灣雲端大會 |

25 分

微服務在雲原生環境的可觀測性應用

DevOpsDays |

38 分

打造敏捷團隊：在付得起的範圍內，瘋狂改變

Agile Summit 敏捷高峰會 |

40 分

醫療服務上雲端 - 智慧醫療的新挑戰與契機

Cloud Summit 臺灣雲端大會 |

43 分

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 28 篇