今天這邊使用 python 的 Beautiful Soup
模組來試試簡單的網頁爬蟲.
安裝 beautifulsoup4
pip3 install beautifulsoup4
爬 yahoo電影排行榜 的電影排名內容.首先要能取得網頁內容,需要多使用 ssl 模組,否則會遇到 CERTIFICATE_VERIFY_FAILED 錯誤.
>>> import ssl
>>> context = ssl._create_unverified_context()
>>> req_obj = request.Request('https://movies.yahoo.com.tw/chart.html')
>>> with request.urlopen(req_obj,context=context) as res_obj:
>>> print(res_obj.read())
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n <meta charset="UTF-8">\n <meta name="viewport" content="width=device-width, initial-scale=1, user-minimum-scale=1, maximum-scale=1">\n <meta http-equiv="content-type" content="text/html; charset=utf-8">\n <meta property="fb:app_id" content="501887343352051">\n <meta property="og:site_name" content="Yahoo\xe5\xa5\x87\xe6\x91\xa9\xe9\x9b\xbb\xe5\xbd\xb1">\n <title>\xe5\x8f\xb0\xe5\x8c\x97\xe7\xa5\xa8\xe6\x88\xbf\xe6\xa6\x9c
......
使用 html.parser 來 parser 讀取的網頁內容,使用soup.prettify
可以看網頁內容.
>>> from bs4 import BeautifulSoup
>>> with request.urlopen(req_obj,context=context) as res_obj:
... resp = res_obj.read().decode('utf-8')
... soup = BeautifulSoup(resp , 'html.parser')
... print(soup.prettify())
...
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, user-minimum-scale=1, maximum-scale=1" name="viewport"/>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="501887343352051" property="fb:app_id"/>
<meta content="Yahoo奇摩電影" property="og:site_name"/>
<title>
台北票房榜 - Yahoo奇摩電影
</title>
...
接著要去找要爬的內容的網頁區塊,找電影排名的區塊是被 <div class="rank_list table rankstyle1">
包起來的.
<div class="rank_list table rankstyle1">
<div class="tr top">
<div class="td">本週</div>
<div class="td updown"></div>
<div class="td">上週</div>
<div class="td">片名</div>
<div class="td">上映日期</div>
<div class="td">預告片</div>
<div class="td">網友滿意度</div>
</div>
完整的爬蟲程式
import ssl
from urllib import request, parse
from bs4 import BeautifulSoup
context = ssl._create_unverified_context()
req_obj = request.Request('https://movies.yahoo.com.tw/chart.html')
with request.urlopen(req_obj,context=context) as res_obj:
resp = res_obj.read().decode('utf-8')
soup = BeautifulSoup(resp , 'html.parser')
rows = soup.find_all('div', class_ = 'tr')
colname = list(rows.pop(0).stripped_strings)
contents = []
for row in rows:
thisweek_rank = row.find_next('div' , attrs={'class' : 'td'})
updown = thisweek_rank.find_next('div')
lastweek_rank = updown.find_next('div')
if thisweek_rank.string == str(1):
movie_title = lastweek_rank.find_next('h2')
else:
movie_title = lastweek_rank.find_next('div' , attrs={'class' : 'rank_txt'})
release_date = movie_title.find_next('div' , attrs={'class' : 'td'})
trailer = release_date.find_next('div' , attrs={'class' : 'td'})
if trailer.find('a') is None:
trailer_address = ''
else:
trailer_address = trailer.find('a')['href']
starts = row.find('h6' , attrs={'class' : 'count'})
lastweek_rank = lastweek_rank.string if lastweek_rank.string else ''
c = [thisweek_rank.string , lastweek_rank , movie_title.string , release_date.string , trailer_address , starts.string]
contents.append(c)
print(contents)
執行 crawler.py
> python3 crawler.py
[['1', '1', '返校', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E8%BF%94%E6%A0%A1-400%E7%A7%92%E5%B8%B6%E4%BD%A0%E5%9B%9E%E9%A1%A7%E9%9B%BB%E5%BD%B1%E5%8E%9F%E5%9E%8B%E6%95%85%E4%BA%8B-xxy-111923492.html', '4.3'], ['2', '2', '天氣之子', '2019-09-12', 'https://movies.yahoo.com.tw/video/%E7%84%A1%E9%9B%B7%E5%BD%B1%E8%A9%95-%E5%A4%A9%E6%B0%A3%E4%B9%8B%E5%AD%90-%E8%A8%BB%E5%AE%9A%E8%A9%95%E5%83%B9%E5%85%A9%E6%A5%B5%E7%9A%84%E5%8B%95%E7%95%AB%E9%9B%BB%E5%BD%B1-xxy%E8%A9%95%E9%9B%BB%E5%BD%B1-030333793.html', '4.3'], ['3', '3', '星際救援', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E6%98%9F%E9%9A%9B%E6%95%91%E6%8F%B4-%E8%AA%B0%E6%89%8D%E6%98%AF%E5%AE%8C%E7%BE%8E%E5%A4%AA%E7%A9%BA%E4%BA%BA-xxy%E8%A9%95%E9%9B%BB%E5%BD%B1-043512139.html', '3.8'], ['4', '', '青春豬頭少年不會夢到懷夢美少女', '2019-09-27', '', '4.5'], ['5', '', '無間行動', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E7%84%A1%E9%96%93%E8%A1%8C%E5%8B%95-%E5%85%A8%E9%9D%A2%E9%80%83%E6%AE%BA%E7%89%88%E9%A0%90%E5%91%8A-025134973.html', '4.1'], ['6', '5', '全面攻佔3: 天使救援', '2019-08-21', 'https://movies.yahoo.com.tw/video/%E5%85%A8%E9%9D%A2%E6%94%BB%E4%BD%943-%E5%A4%A9%E4%BD%BF%E6%95%91%E6%8F%B4-%E8%8B%B1%E9%9B%84%E5%88%B0%E5%BA%95%E9%80%80%E4%B8%8D%E9%80%80%E5%A0%B4-xxy%E8%A9%95%E9%9B%BB%E5%BD%B1-034051084.html', '4.2'], ['7', '4', '牠 第二章', '2019-09-05', 'https://movies.yahoo.com.tw/video/%E7%89%A0-%E7%AC%AC%E4%BA%8C%E7%AB%A0-%E8%A7%A3%E6%9E%90-%E8%A2%AB%E7%BE%8E%E8%B2%8C%E8%A9%9B%E5%92%92%E7%9A%84%E8%B2%9D%E8%8A%99%E8%8E%89%E9%A6%AC%E8%A8%B1-160000560.html', '4'], ['8', '', '瞞天機密', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E7%9E%9E%E5%A4%A9%E6%A9%9F%E5%AF%86-%E5%8B%87%E6%B0%A3%E7%89%88%E9%A0%90%E5%91%8A-084815060.html', '4.1'], ['9', '', '信用詐欺師JP', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E4%BF%A1%E7%94%A8%E8%A9%90%E6%AC%BA%E5%B8%ABjp-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-062304730.html', '4'], ['10', '', '囧媽的極地任務', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E5%9B%A7%E5%AA%BD%E7%9A%84%E6%A5%B5%E5%9C%B0%E4%BB%BB%E5%8B%99-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-025032372.html', '4.2'], ['11', '', '校外打怪教學', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E6%A0%A1%E5%A4%96%E6%89%93%E6%80%AA%E6%95%99%E5%AD%B8-%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-062837459.html', '3.7'], ['12', '10', '普羅米亞', '2019-08-16', 'https://movies.yahoo.com.tw/video/%E6%99%AE%E7%BE%85%E7%B1%B3%E4%BA%9E-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-144302686.html', '3.8'], ['13', '', '變身', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E8%AE%8A%E8%BA%AB-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-084131268.html', '3.8'], ['14', '', '笑笑羊大電影:外星人來了', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E7%AC%91%E7%AC%91%E7%BE%8A%E5%A4%A7%E9%9B%BB%E5%BD%B1-%E5%A4%96%E6%98%9F%E4%BA%BA%E4%BE%86%E4%BA%86-%E4%B8%AD%E6%96%87%E9%85%8D%E9%9F%B3%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-030458730.html', '4'], ['15', '8', '唐頓莊園', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E5%94%90%E9%A0%93%E8%8E%8A%E5%9C%92-%E5%9B%9E%E9%A1%A7%E7%AF%87-044725185.html', '4.1'], ['16', '7', '極限逃生', '2019-08-30', 'https://movies.yahoo.com.tw/video/%E6%A5%B5%E9%99%90%E9%80%83%E7%94%9F-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-134635519.html', '4.1'], ['17', '6', '第九分局', '2019-08-29', 'https://movies.yahoo.com.tw/video/%E7%AC%AC%E4%B9%9D%E5%88%86%E5%B1%80-%E5%8B%95%E4%BD%9C-%E7%89%B9%E6%95%88%E8%88%87%E5%8C%96%E5%A6%9D%E7%AF%87-130453384.html', '3.9'], ['18', '', '雪地之光', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E9%9B%AA%E5%9C%B0%E4%B9%8B%E5%85%89-%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-033605254.html', '3.6'], ['19', '12', '殺手餐廳', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E6%AE%BA%E6%89%8B%E9%A4%90%E5%BB%B3-%E8%9C%B7%E5%B7%9D%E5%AF%A6%E8%8A%B1%E5%B0%8E%E6%BC%94%E7%AF%87-065439673.html', '3.9'], ['20', '9', '好小男孩', '2019-09-12', 'https://movies.yahoo.com.tw/video/%E5%A5%BD%E5%B0%8F%E7%94%B7%E5%AD%A9-%E5%B9%95%E5%BE%8C%E8%8A%B1%E7%B5%AE%E7%AF%87-122756018.html', '3.5']]
練習去 好樂迪KTV 的網站爬曲前幾名的歌曲.到官網後用檢視原始碼找到排行的 HTML 區塊如下,可以看到 table 裡第一個 tr 跟最後一個 tr 不是歌曲排行的內容.所以到時候要濾掉.
<table cellspacing="0" cellpadding="4" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dgSong"
style="background-color:White;border-color:White;border-width:1px;border-style:solid;width:100%;border-collapse:collapse;">
<tbody>
<tr align="center" valign="middle" style="color:White;background-color:Black;">
<td>本週</td>
<td>上週</td>
<td>週數</td>
<td align="center" valign="middle">點歌<br>曲號
</td>
<td align="center" valign="middle" style="width:34%;">歌名</td>
<td align="center" valign="middle">歌手</td>
</tr>
<tr align="center" valign="middle" style="background-color:#EAEAEA;">
<td style="background-color:#666666;">
<font size="4">
<span id="ctl00_ContentPlaceHolder1_dgSong_ctl03_lbThisWeek"
style="color:White;font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">1
</span>
</font>
</td>
<td>
<font size="4">
<span id="ctl00_ContentPlaceHolder1_dgSong_ctl03_lbLastWeek"
style="color:#333333;font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">4
</span>
</font>
</td>
<td>
<font size="4">
<span id="ctl00_ContentPlaceHolder1_dgSong_ctl03_lbWeeks"
style="color:#999999;font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">3
</span>
</font>
</td>
<td align="center" valign="middle" style="font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">
26009
</td>
<td align="center" valign="middle">來個蹦蹦</td>
<td align="center" valign="middle">
<a href="#" onclick="javascript:GoSearch("玖壹壹.Ella(陳嘉樺) ");">
玖壹壹.Ella(陳嘉樺)
</a>
</td>
</tr>
<tr align="center" valign="middle" style="background-color:#CCCCCC;">
......
</tr>
<tr align="center" style="font-weight:bold;text-decoration:none;width:100%;">
<td colspan="6"><span>1</span> <a
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgSong$ctl24$ctl03','')">2</a> <a
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgSong$ctl24$ctl04','')">3
</a>
<a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgSong$ctl24$ctl01','')">下一頁</a>
</td>
</tr>
</tbody>
</table>
建立一支爬蟲程式holiday.py
,首先先取回網頁內容,然後建立 BeautifulSoup 物件,就可以透過該物件提供的 api 去取得需要的元素的值,find_all
會把多種條件全部取回來,而find
只會取回第一個.這邊有使用到 pandas 模組這邊先不做介紹直接使用,會有另一篇介紹.
import ssl
from urllib import request, parse
from bs4 import BeautifulSoup
import pandas as pd
# 使用 ssl 模組,避免遇到 CERTIFICATE_VERIFY_FAILED 錯誤
context = ssl._create_unverified_context()
# 給好樂迪的網址建立 Request
req_obj = request.Request('https://www.holiday.com.tw/song/Billboard.aspx')
song_list = []
# 發送 request
with request.urlopen(req_obj,context=context) as res_obj:
# 將 response 讀回並用 utf8 decode
resp = res_obj.read().decode('utf-8')
# 使用 html.parser
soup = BeautifulSoup(resp , 'html.parser')
# 用 find 找到 id 為 ctl00_ContentPlaceHolder1_dgSong 的 table 標籤,並回傳 table 內所有的 tr 內容
rank_table = soup.find('table',id='ctl00_ContentPlaceHolder1_dgSong').find_all('tr')
#由於要避開 table 的第一列 tr 資料以及最後一列 tr 資料,所以取 [1:-2]
for rt in rank_table[1:-2]:
# 找到所有的 td 並取得第 5 個 td(index 是 4)
song_name = rt.find_all('td')[4]
# 找到第一個 a 這個標籤,因為只有歌手的資料被 a tag 包住
singer = rt.find('a')
# 把歌曲跟歌手的資料轉成 string 並去前後空白塞到一個 song_list
song_list.append([song_name.string.strip(),singer.string.strip()])
# 把 song_list 使用 pandas 模組轉成 dataframe 用於後面資料分析
df = pd.DataFrame(song_list,columns=['song','singer'])
print(df)
執行結果
> python3 holiday.py
song singer
0 來個蹦蹦 玖壹壹.Ella(陳嘉樺)
1 過客 莊心妍
2 I Go 周湯豪
3 走心 賀敬軒
4 多想留在你身邊 劉增瞳
5 終於了解自由 周興哲
6 沒有你陪伴真的好孤單 夢然
7 此刻你聽好了 劉嘉亮
8 說一句我不走了 林芯儀
9 Be Alright 高爾宣OSN
10 可不可以 季彥霖
11 至少我還記得 周興哲
12 預謀 許佳慧
13 知否知否 胡夏.郁可唯
14 太空人 吳青峰
15 重感情的廢物 TRASH
16 何妨 家家.茄子蛋
17 太空 吳青峰
18 兩秒終 周湯豪