網路爬蟲-爬取串流平台熱門週排行前50名歌曲清單

網路爬蟲

AlbertShiu 2024-03-27 20:30:04 ‧ 1314 瀏覽

分享至

目的：爬取串流平台熱門週排行前50名歌曲清單，以利後續作串流平台的數據分析。

步驟：

匯入需要套件
設定headers模擬人為操作
爬取週排行前50名歌曲清單

匯入需要套件

import re
import requests
import json
import MySQLdb
from bs4 import BeautifulSoup
from sqlalchemy import create_engine

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

設定headers模擬人為操作

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}

如何找到headers?
在目標網頁下按F12進入開發者模式，隨選一個頁面(下方選擇名稱為all的頁面)，接著在User-Agent項目可找到headers，如此就可以偽裝人為正常登入。

爬取週排行前50名歌曲清單
連結MySQL，以便後續裝取的資料直接存入MySQL中。

# Credentials to database connection
hostname="localhost"
dbname="street_rank"
uname="OOO"
pwd="XXXXXX"

# Create SQLAlchemy engine to connect to MySQL Database
engine = create_engine("mysql://{user}:{pw}@{host}/{db}"
				.format(host=hostname, db=dbname, user=uname, pw=pwd)).connect()

針對要抓取數據頁面的網址，設定GET的request。

url = 'https://streetvoice.com/music/charts/weekly/2024/'+str(period)+'/all/'
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
# use beautifulsoup to analyze html
soup = BeautifulSoup(response.text, 'html.parser')

爬取前50名的名次、作品名稱、作品連結網址、曲風類別、作曲者及作曲者連結網址。
在F12開發者模式中，知道上述內容被歸類在標籤”div”與級別”work-item-info”中。

# scrap the Top 50 work and their link
rank_work = soup.find_all('div', {'class':'work-item-info'})

儲存至MySQL中

# Convert dataframe to sql table
df.to_sql(date, engine, index=False, if_exists='replace')

列出部分歌曲(前五名)資訊如下：

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19416 篇

完賽人數

530 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

網路爬蟲-爬取串流平台熱門週排行前50名歌曲清單

目的：爬取串流平台熱門週排行前50名歌曲清單，以利後續作串流平台的數據分析。

步驟：

尚未有邦友留言

標記使用者