Day27 BS4 Scrape from Youtube 1/2 用美麗的湯爬取Youtube 1/2

第 11 屆 iThome 鐵人賽

DAY 27

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 27 篇

11th鐵人賽 pandas webscraping youtube beautifulsoup

kyt

2019-09-28 06:43:54

3837 瀏覽

分享至

今天嘗試來用美麗的湯從Youtube爬取影片標題、連結、觀看次數與簡介。爬取的資料是一個好聽德國樂團Berge的Youtube搜尋頁面(是想趁機推坑吧笑死)。
Today we are going to scrape title, link, view counts, and descriptions from Youtube using Beautiful Soup. The url we are scraping today is the search result on Youtube of a German band Berge.

# 同前兩天的起始步驟，載入套件、創建美麗湯物件
# Same set up steps as in day25, import packages, set up a beautiful soup object
import requests
from bs4 import BeautifulSoup

url = "https://www.youtube.com/results?search_query=Berge"
request = requests.get(url)
content = request.content
soup = BeautifulSoup(content, "html.parser")
print(soup) # 看一下怎麼區分的 have a peak of how each video's seperated

# 印出第一筆資料來觀察 print out the first data to see 
look_up = []
for vid in soup.select(".yt-lockup-video"):
    look_up.append(vid)
    
print(look_up[0])

創建一些空清單等等可以裝我們想要的資料

Create several new list so we can put data we want into them later on

title = []
uploaded = []
watch_counts = []
description = []
link = []

讀取每一筆資料，區分出我們要的部分然後存進各類別清單

Read through the data, filter out the ones we want, then save them into the lists we created

for vid in soup.select(".yt-lockup-video"):
    
    data = vid.select("a[rel='spf-prefetch']") # 找到有我們要的資料區塊 get the section that contains the info we want
    title.append(data[0].get("title")) # 存標題 save the title
    link.append("https://www.youtube.com{}".format(data[0].get("href"))) # 存連結 save the link
    
    time = vid.select(".yt-lockup-meta-info") # 找到有我們要的資料區塊 get the section that contains the info we want
    uploaded.append(time[0].get_text("#").split("#")[0]) # 存上傳時間 save the uploaded time
    watch_counts.append(time[0].get_text("#").split("#")[1]) # 存觀看次數 save the watch counts
    
    disc = vid.select(".yt-lockup-description") # 找到有我們要的資料區塊 get the section that contains the info we want
    try:
        description.append(disc[0].get_text()) # 有找到簡介的存起來 save if there's description
    except:
        description.append("Nan")
        continue # 如果沒有就存一個空值然後跳過 append a Nan value and continue if there's none

print(title[0])
print(uploaded[0])
print(watch_counts[0])
print(description[0])
print(link[0])

針對取得簡介的部分說明 Watch out for the getting description part

像上圖的頻道連結或播放清單等，有時裡面會沒有文字或沒有這個區塊，所以要跳過避免報錯。
When we look at the picture above, we can see that sometimes the search results contains the channel of the artist or even playlists. And there might be no string in the description or even without the description section. Thus, we need to set up a continue iteration for our loop to avoid errors occur.

儲存我們爬取到的資料 Save the data we scraped

import pandas as pd
berge = {'Title':title, 'Uploaded':uploaded, 'Watch_Counts':watch_counts, 'Description':description, 'Link':link}
berge = pd.DataFrame(berge)
berge.head()