iT邦幫忙

第 11 屆 iThome 鐵人賽

DAY 27
0
AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 27

Day27 BS4 Scrape from Youtube 1/2 用美麗的湯爬取Youtube 1/2

  • 分享至 

  • xImage
  •  

今天嘗試來用美麗的湯從Youtube爬取影片標題、連結、觀看次數與簡介。爬取的資料是一個好聽德國樂團Berge的Youtube搜尋頁面(是想趁機推坑吧笑死)。
Today we are going to scrape title, link, view counts, and descriptions from Youtube using Beautiful Soup. The url we are scraping today is the search result on Youtube of a German band Berge.
https://ithelp.ithome.com.tw/upload/images/20190927/20119709nFwV36nPiI.jpg

# 同前兩天的起始步驟,載入套件、創建美麗湯物件
# Same set up steps as in day25, import packages, set up a beautiful soup object
import requests
from bs4 import BeautifulSoup

url = "https://www.youtube.com/results?search_query=Berge"
request = requests.get(url)
content = request.content
soup = BeautifulSoup(content, "html.parser")
print(soup) # 看一下怎麼區分的 have a peak of how each video's seperated
# 印出第一筆資料來觀察 print out the first data to see 
look_up = []
for vid in soup.select(".yt-lockup-video"):
    look_up.append(vid)
    
print(look_up[0])

https://ithelp.ithome.com.tw/upload/images/20190927/20119709UQnp77Kjgp.jpg

創建一些空清單等等可以裝我們想要的資料

Create several new list so we can put data we want into them later on

title = []
uploaded = []
watch_counts = []
description = []
link = []

讀取每一筆資料,區分出我們要的部分然後存進各類別清單

Read through the data, filter out the ones we want, then save them into the lists we created

https://ithelp.ithome.com.tw/upload/images/20190927/20119709bxdce6cLXQ.jpg

for vid in soup.select(".yt-lockup-video"):
    
    data = vid.select("a[rel='spf-prefetch']") # 找到有我們要的資料區塊 get the section that contains the info we want
    title.append(data[0].get("title")) # 存標題 save the title
    link.append("https://www.youtube.com{}".format(data[0].get("href"))) # 存連結 save the link
    
    time = vid.select(".yt-lockup-meta-info") # 找到有我們要的資料區塊 get the section that contains the info we want
    uploaded.append(time[0].get_text("#").split("#")[0]) # 存上傳時間 save the uploaded time
    watch_counts.append(time[0].get_text("#").split("#")[1]) # 存觀看次數 save the watch counts
    
    disc = vid.select(".yt-lockup-description") # 找到有我們要的資料區塊 get the section that contains the info we want
    try:
        description.append(disc[0].get_text()) # 有找到簡介的存起來 save if there's description
    except:
        description.append("Nan")
        continue # 如果沒有就存一個空值然後跳過 append a Nan value and continue if there's none

print(title[0])
print(uploaded[0])
print(watch_counts[0])
print(description[0])
print(link[0])

https://ithelp.ithome.com.tw/upload/images/20190927/201197090Uop0pBdWR.jpg

針對取得簡介的部分說明 Watch out for the getting description part

像上圖的頻道連結或播放清單等,有時裡面會沒有文字或沒有這個區塊,所以要跳過避免報錯。
When we look at the picture above, we can see that sometimes the search results contains the channel of the artist or even playlists. And there might be no string in the description or even without the description section. Thus, we need to set up a continue iteration for our loop to avoid errors occur.

儲存我們爬取到的資料 Save the data we scraped

import pandas as pd
berge = {'Title':title, 'Uploaded':uploaded, 'Watch_Counts':watch_counts, 'Description':description, 'Link':link}
berge = pd.DataFrame(berge)
berge.head()

https://ithelp.ithome.com.tw/upload/images/20190927/201197098VKmKMuv03.jpg

berge.to_csv('berge.csv')

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料:

[1] Beautiful Soup 4.2.0

[2] 爬蟲實戰-Youtube

[3] Youtube

[4] IndexError


上一篇
Day26 Stepstone Posting 達石職缺
下一篇
Day28 BS4: Scrape from Youtube 2/2 用美麗的湯爬取Youtube 2/2
系列文
Hands on Data Cleaning and Scraping 資料清理與爬蟲實作30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言