Day 4 速成爬蟲的第二課 : 資料解析

2022 iThome 鐵人賽

DAY 4

Modern Web

mitmproxy 在網路爬蟲上的各種應用系列第 4 篇

14th鐵人賽網路爬蟲

Yotsuba

2022-09-19 07:48:28

1489 瀏覽

分享至

資料解析

現在假設你的請求沒有發生錯誤，那麼資料解析就是拿到回應後該做的事情。

資料解析對爬蟲來說是非常基本的需求，因為回應通常不太可能全部都是自己要的資料。

HTML

HTML 是最常見的資料。我們通常需要一個 HTML 解析器 ( HTML Parser ) 來幫我們處理 HTML。

Python 有一個強大的函式庫叫做 Beautiful Soup。

它也可以用於解析 XML。不過老實說，筆者在爬網站的過程，從來沒有需要解析 XML 資料的時候 ...

Beautiful Soup, so rich and green,

Waiting in a hot tureen !

Who for such dainties would not stoop ?

Soup of the evening, beautiful Soup !

Beautiful Soup 以愛麗絲夢遊仙境中的一首詩命名，故事中此詩由 Mock Turtle 吟誦。

( 與維多利亞時代以牛而非龜作為材料的 Mock Turtle Soup 雙關語 )

如同夢遊仙境，Beautiful Soup 試著讓無厘頭有道理 ; 它矯正不良 HTML 的一團亂以產生 XML 結構的 Python 物件。

引用至書籍《網站擷取：使用Python》

Beautiful Soup 的官方範例

Beautiful Soup 的功能太多，筆者無法涵蓋到所有內容，詳細的部份還請讀者去閱讀他們的官方文件。

以下的程式碼不需要連上網路，大家可以想像，請求到的回應就是 html_doc 的內容。

我們會把 HTML 丟進 Beautiful Soup 裡面，然後得到一碗 soup。接下來，對 soup 操作就等同於操作整份 HTML。

比方說我們可以把網站的 title 節點取出來，或者把 a 節點的內容迭代取出來。

在思維上，就是看我們在意的資料隸屬於哪個節點 ? 帶有什麼屬性 ?

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.title)

for a in soup.find_all('a'):
    print(a)

# Output :
#
# <title>The Dormouse's story</title>
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

JSON

JSON 也屬於很常見的資料結構，通常用於網站的 AJAX 請求 ( 後續的章節會詳細介紹到 AJAX )。

這邊借用一下 httpbin 的免費 API 來使用一下。

他們網站有提供各式各樣的 API，我選了一個能回傳 JSON 資料的。

以下的程式碼快速展示了 Python 的 json 函式庫如何使用 ?

你會發現到，原來 json.loads 會把字串變成 Python 的字典型態。

json.dumps 又可以把字典變回字串，加上 sort_keys 和 indent 的參數甚至可以讓它變整齊和改變縮排。

ensure_ascii = False 的用途是讓 json.dumps 不要檢查是不是 ASCII，免得中文字無法正常顯示。

import json
import requests

response = requests.get('https://httpbin.org/json')

data = json.loads(response.text)

print(type(data))

print(json.dumps(data, sort_keys = True, indent = 4, ensure_ascii = False))

# Output :
#
# <class 'dict'>
# {
#     "slideshow": {
#         "author": "Yours Truly",
#         "date": "date of publication",
#         "slides": [
#             {
#                 "title": "Wake up to WonderWidgets!",
#                 "type": "all"
#             },
#             {
#                 "items": [
#                     "Why <em>WonderWidgets</em> are great",
#                     "Who <em>buys</em> WonderWidgets"
#                 ],
#                 "title": "Overview",
#                 "type": "all"
#             }
#         ],
#         "title": "Sample Slide Show"
#     }
# }

以上的程式碼可以再更簡潔。

使用 requests 函式庫請求到的回應，如果你能保證它在轉型成 JSON 格式的時候不會出錯，你就可以使用 json() 這個方法。

import requests

response = requests.get('https://httpbin.org/json')

print(response.json())

Python 的 json.tool 這個模組可以達到和第一段程式碼一樣的功能。

我常常利用這個小技巧，在 curl 拿到的回應是 JSON 時，以這樣的方式查看結果。

$ curl -s https://httpbin.org/json | python3 -m json.tool --no-ensure-ascii

參考文章

Day 3 速成爬蟲的第一課 : 請求

Day 5 速成爬蟲的第三課 : 認證

系列文

mitmproxy 在網路爬蟲上的各種應用共 18 篇

RSS系列文訂閱系列文

10 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22211 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

mitmproxy 在網路爬蟲上的各種應用系列 第 4 篇