iT邦幫忙

第 11 屆 iThome 鐵人賽

DAY 26
0
Software Development

python 自學系列 第 26

python day26(crawler)

  • 分享至 

  • xImage
  •  

今天這邊使用 python 的 Beautiful Soup 模組來試試簡單的網頁爬蟲.

安裝 beautifulsoup4

pip3 install beautifulsoup4

yahoo電影排行榜 的電影排名內容.首先要能取得網頁內容,需要多使用 ssl 模組,否則會遇到 CERTIFICATE_VERIFY_FAILED 錯誤.

>>> import ssl
>>> context = ssl._create_unverified_context()
>>> req_obj = request.Request('https://movies.yahoo.com.tw/chart.html')
>>> with request.urlopen(req_obj,context=context) as res_obj:
>>>  print(res_obj.read())
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n  <meta charset="UTF-8">\n  <meta name="viewport" content="width=device-width, initial-scale=1, user-minimum-scale=1, maximum-scale=1">\n  <meta http-equiv="content-type" content="text/html; charset=utf-8">\n  <meta property="fb:app_id" content="501887343352051">\n  <meta property="og:site_name" content="Yahoo\xe5\xa5\x87\xe6\x91\xa9\xe9\x9b\xbb\xe5\xbd\xb1">\n    <title>\xe5\x8f\xb0\xe5\x8c\x97\xe7\xa5\xa8\xe6\x88\xbf\xe6\xa6\x9c
......

使用 html.parser 來 parser 讀取的網頁內容,使用soup.prettify 可以看網頁內容.

>>> from bs4 import BeautifulSoup
>>> with request.urlopen(req_obj,context=context) as res_obj:
...  resp = res_obj.read().decode('utf-8')
...  soup = BeautifulSoup(resp , 'html.parser')
...  print(soup.prettify())
...
<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, user-minimum-scale=1, maximum-scale=1" name="viewport"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="501887343352051" property="fb:app_id"/>
  <meta content="Yahoo奇摩電影" property="og:site_name"/>
  <title>
   台北票房榜 - Yahoo奇摩電影
  </title>
  ...

接著要去找要爬的內容的網頁區塊,找電影排名的區塊是被 <div class="rank_list table rankstyle1"> 包起來的.

<div class="rank_list table rankstyle1">
    <div class="tr top">
      <div class="td">本週</div>
      <div class="td updown"></div>
      <div class="td">上週</div>
      <div class="td">片名</div>
      <div class="td">上映日期</div>
      <div class="td">預告片</div>
      <div class="td">網友滿意度</div>
    </div>
        

完整的爬蟲程式

import ssl
from urllib import request, parse
from bs4 import BeautifulSoup

context = ssl._create_unverified_context()
req_obj = request.Request('https://movies.yahoo.com.tw/chart.html')
with request.urlopen(req_obj,context=context) as res_obj:
 resp = res_obj.read().decode('utf-8')
 soup = BeautifulSoup(resp , 'html.parser')
 rows = soup.find_all('div', class_ = 'tr')

 colname = list(rows.pop(0).stripped_strings)
 contents = []
 for row in rows:
  thisweek_rank = row.find_next('div' , attrs={'class' : 'td'})
  updown = thisweek_rank.find_next('div')
  lastweek_rank = updown.find_next('div')

  if thisweek_rank.string == str(1):
   movie_title = lastweek_rank.find_next('h2')
  else:
   movie_title = lastweek_rank.find_next('div' , attrs={'class' : 'rank_txt'})

  release_date = movie_title.find_next('div' , attrs={'class' : 'td'})
  trailer = release_date.find_next('div' , attrs={'class' : 'td'})

  if trailer.find('a') is None:
   trailer_address = ''
  else:
   trailer_address = trailer.find('a')['href']

  starts = row.find('h6' , attrs={'class' : 'count'})

  lastweek_rank = lastweek_rank.string if lastweek_rank.string else ''

  c = [thisweek_rank.string , lastweek_rank , movie_title.string , release_date.string , trailer_address , starts.string]
  contents.append(c)

print(contents)

執行 crawler.py

> python3 crawler.py
[['1', '1', '返校', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E8%BF%94%E6%A0%A1-400%E7%A7%92%E5%B8%B6%E4%BD%A0%E5%9B%9E%E9%A1%A7%E9%9B%BB%E5%BD%B1%E5%8E%9F%E5%9E%8B%E6%95%85%E4%BA%8B-xxy-111923492.html', '4.3'], ['2', '2', '天氣之子', '2019-09-12', 'https://movies.yahoo.com.tw/video/%E7%84%A1%E9%9B%B7%E5%BD%B1%E8%A9%95-%E5%A4%A9%E6%B0%A3%E4%B9%8B%E5%AD%90-%E8%A8%BB%E5%AE%9A%E8%A9%95%E5%83%B9%E5%85%A9%E6%A5%B5%E7%9A%84%E5%8B%95%E7%95%AB%E9%9B%BB%E5%BD%B1-xxy%E8%A9%95%E9%9B%BB%E5%BD%B1-030333793.html', '4.3'], ['3', '3', '星際救援', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E6%98%9F%E9%9A%9B%E6%95%91%E6%8F%B4-%E8%AA%B0%E6%89%8D%E6%98%AF%E5%AE%8C%E7%BE%8E%E5%A4%AA%E7%A9%BA%E4%BA%BA-xxy%E8%A9%95%E9%9B%BB%E5%BD%B1-043512139.html', '3.8'], ['4', '', '青春豬頭少年不會夢到懷夢美少女', '2019-09-27', '', '4.5'], ['5', '', '無間行動', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E7%84%A1%E9%96%93%E8%A1%8C%E5%8B%95-%E5%85%A8%E9%9D%A2%E9%80%83%E6%AE%BA%E7%89%88%E9%A0%90%E5%91%8A-025134973.html', '4.1'], ['6', '5', '全面攻佔3: 天使救援', '2019-08-21', 'https://movies.yahoo.com.tw/video/%E5%85%A8%E9%9D%A2%E6%94%BB%E4%BD%943-%E5%A4%A9%E4%BD%BF%E6%95%91%E6%8F%B4-%E8%8B%B1%E9%9B%84%E5%88%B0%E5%BA%95%E9%80%80%E4%B8%8D%E9%80%80%E5%A0%B4-xxy%E8%A9%95%E9%9B%BB%E5%BD%B1-034051084.html', '4.2'], ['7', '4', '牠 第二章', '2019-09-05', 'https://movies.yahoo.com.tw/video/%E7%89%A0-%E7%AC%AC%E4%BA%8C%E7%AB%A0-%E8%A7%A3%E6%9E%90-%E8%A2%AB%E7%BE%8E%E8%B2%8C%E8%A9%9B%E5%92%92%E7%9A%84%E8%B2%9D%E8%8A%99%E8%8E%89%E9%A6%AC%E8%A8%B1-160000560.html', '4'], ['8', '', '瞞天機密', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E7%9E%9E%E5%A4%A9%E6%A9%9F%E5%AF%86-%E5%8B%87%E6%B0%A3%E7%89%88%E9%A0%90%E5%91%8A-084815060.html', '4.1'], ['9', '', '信用詐欺師JP', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E4%BF%A1%E7%94%A8%E8%A9%90%E6%AC%BA%E5%B8%ABjp-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-062304730.html', '4'], ['10', '', '囧媽的極地任務', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E5%9B%A7%E5%AA%BD%E7%9A%84%E6%A5%B5%E5%9C%B0%E4%BB%BB%E5%8B%99-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-025032372.html', '4.2'], ['11', '', '校外打怪教學', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E6%A0%A1%E5%A4%96%E6%89%93%E6%80%AA%E6%95%99%E5%AD%B8-%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-062837459.html', '3.7'], ['12', '10', '普羅米亞', '2019-08-16', 'https://movies.yahoo.com.tw/video/%E6%99%AE%E7%BE%85%E7%B1%B3%E4%BA%9E-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-144302686.html', '3.8'], ['13', '', '變身', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E8%AE%8A%E8%BA%AB-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-084131268.html', '3.8'], ['14', '', '笑笑羊大電影:外星人來了', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E7%AC%91%E7%AC%91%E7%BE%8A%E5%A4%A7%E9%9B%BB%E5%BD%B1-%E5%A4%96%E6%98%9F%E4%BA%BA%E4%BE%86%E4%BA%86-%E4%B8%AD%E6%96%87%E9%85%8D%E9%9F%B3%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-030458730.html', '4'], ['15', '8', '唐頓莊園', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E5%94%90%E9%A0%93%E8%8E%8A%E5%9C%92-%E5%9B%9E%E9%A1%A7%E7%AF%87-044725185.html', '4.1'], ['16', '7', '極限逃生', '2019-08-30', 'https://movies.yahoo.com.tw/video/%E6%A5%B5%E9%99%90%E9%80%83%E7%94%9F-%E4%B8%AD%E6%96%87%E9%A0%90%E5%91%8A-134635519.html', '4.1'], ['17', '6', '第九分局', '2019-08-29', 'https://movies.yahoo.com.tw/video/%E7%AC%AC%E4%B9%9D%E5%88%86%E5%B1%80-%E5%8B%95%E4%BD%9C-%E7%89%B9%E6%95%88%E8%88%87%E5%8C%96%E5%A6%9D%E7%AF%87-130453384.html', '3.9'], ['18', '', '雪地之光', '2019-09-27', 'https://movies.yahoo.com.tw/video/%E9%9B%AA%E5%9C%B0%E4%B9%8B%E5%85%89-%E6%AD%A3%E5%BC%8F%E9%A0%90%E5%91%8A-033605254.html', '3.6'], ['19', '12', '殺手餐廳', '2019-09-20', 'https://movies.yahoo.com.tw/video/%E6%AE%BA%E6%89%8B%E9%A4%90%E5%BB%B3-%E8%9C%B7%E5%B7%9D%E5%AF%A6%E8%8A%B1%E5%B0%8E%E6%BC%94%E7%AF%87-065439673.html', '3.9'], ['20', '9', '好小男孩', '2019-09-12', 'https://movies.yahoo.com.tw/video/%E5%A5%BD%E5%B0%8F%E7%94%B7%E5%AD%A9-%E5%B9%95%E5%BE%8C%E8%8A%B1%E7%B5%AE%E7%AF%87-122756018.html', '3.5']]

crawler holida

練習去 好樂迪KTV 的網站爬曲前幾名的歌曲.到官網後用檢視原始碼找到排行的 HTML 區塊如下,可以看到 table 裡第一個 tr 跟最後一個 tr 不是歌曲排行的內容.所以到時候要濾掉.

<table cellspacing="0" cellpadding="4" rules="all" border="1" id="ctl00_ContentPlaceHolder1_dgSong"
       style="background-color:White;border-color:White;border-width:1px;border-style:solid;width:100%;border-collapse:collapse;">
    <tbody>
        <tr align="center" valign="middle" style="color:White;background-color:Black;">
            <td>本週</td>
            <td>上週</td>
            <td>週數</td>
            <td align="center" valign="middle">點歌<br>曲號
            </td>
            <td align="center" valign="middle" style="width:34%;">歌名</td>
            <td align="center" valign="middle">歌手</td>
        </tr>
        <tr align="center" valign="middle" style="background-color:#EAEAEA;">
            <td style="background-color:#666666;">
                <font size="4">
                    <span id="ctl00_ContentPlaceHolder1_dgSong_ctl03_lbThisWeek"
                          style="color:White;font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">1
                    </span>
                </font>
            </td>
            <td>
                <font size="4">
                    <span id="ctl00_ContentPlaceHolder1_dgSong_ctl03_lbLastWeek"
                          style="color:#333333;font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">4
                    </span>
                </font>
            </td>
            <td>
                <font size="4">
                    <span id="ctl00_ContentPlaceHolder1_dgSong_ctl03_lbWeeks"
                          style="color:#999999;font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">3
                    </span>
                </font>
            </td>
            <td align="center" valign="middle" style="font-family:Geneva,Arial,Helvetica,sans-serif;font-weight:bold;">
                26009
            </td>
            <td align="center" valign="middle">來個蹦蹦</td>
            <td align="center" valign="middle">


                <a href="#" onclick="javascript:GoSearch(&quot;玖壹壹.Ella(陳嘉樺)                    &quot;);">
                    玖壹壹.Ella(陳嘉樺)
                </a>
            </td>
        </tr>
        <tr align="center" valign="middle" style="background-color:#CCCCCC;">
            ......
        </tr>
        
        <tr align="center" style="font-weight:bold;text-decoration:none;width:100%;">
            <td colspan="6"><span>1</span>&nbsp;<a
                    href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgSong$ctl24$ctl03','')">2</a>&nbsp;<a
                    href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgSong$ctl24$ctl04','')">3
            </a>
                <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$dgSong$ctl24$ctl01','')">下一頁</a>
            </td>
        </tr>
    </tbody>
</table>

建立一支爬蟲程式holiday.py,首先先取回網頁內容,然後建立 BeautifulSoup 物件,就可以透過該物件提供的 api 去取得需要的元素的值,find_all會把多種條件全部取回來,而find只會取回第一個.這邊有使用到 pandas 模組這邊先不做介紹直接使用,會有另一篇介紹.

import ssl
from urllib import request, parse
from bs4 import BeautifulSoup
import pandas as pd

# 使用 ssl 模組,避免遇到 CERTIFICATE_VERIFY_FAILED 錯誤
context = ssl._create_unverified_context()
# 給好樂迪的網址建立 Request
req_obj = request.Request('https://www.holiday.com.tw/song/Billboard.aspx')

song_list = []
# 發送 request
with request.urlopen(req_obj,context=context) as res_obj:
       # 將 response 讀回並用 utf8 decode 
	resp = res_obj.read().decode('utf-8')
        # 使用 html.parser
	soup = BeautifulSoup(resp , 'html.parser')
        # 用 find 找到 id 為 ctl00_ContentPlaceHolder1_dgSong 的 table 標籤,並回傳 table 內所有的 tr 內容
	rank_table = soup.find('table',id='ctl00_ContentPlaceHolder1_dgSong').find_all('tr')

        #由於要避開 table 的第一列 tr 資料以及最後一列 tr 資料,所以取 [1:-2] 
	for rt in rank_table[1:-2]:
               # 找到所有的 td 並取得第 5 個 td(index 是 4)
		song_name = rt.find_all('td')[4]
               # 找到第一個 a 這個標籤,因為只有歌手的資料被 a tag 包住
		singer = rt.find('a')
        # 把歌曲跟歌手的資料轉成 string 並去前後空白塞到一個 song_list
	song_list.append([song_name.string.strip(),singer.string.strip()])

# 把 song_list 使用 pandas 模組轉成 dataframe 用於後面資料分析
df = pd.DataFrame(song_list,columns=['song','singer'])
print(df)

執行結果

> python3 holiday.py
          song         singer
0         來個蹦蹦  玖壹壹.Ella(陳嘉樺)
1           過客            莊心妍
2         I Go            周湯豪
3           走心            賀敬軒
4      多想留在你身邊            劉增瞳
5       終於了解自由            周興哲
6   沒有你陪伴真的好孤單             夢然
7       此刻你聽好了            劉嘉亮
8      說一句我不走了            林芯儀
9   Be Alright         高爾宣OSN
10        可不可以            季彥霖
11      至少我還記得            周興哲
12          預謀            許佳慧
13        知否知否         胡夏.郁可唯
14         太空人            吳青峰
15      重感情的廢物          TRASH
16          何妨         家家.茄子蛋
17          太空            吳青峰
18         兩秒終            周湯豪

上一篇
python day25(flask)
下一篇
python day27(pytest)
系列文
python 自學30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言