Day 7 速成爬蟲的第五課 : 資料儲存

2022 iThome 鐵人賽

DAY 7

Modern Web

mitmproxy 在網路爬蟲上的各種應用系列第 7 篇

14th鐵人賽網路爬蟲

Yotsuba

2022-09-22 13:36:48

2000 瀏覽

分享至

資料儲存

會遇到資料儲存的狀況，通常是用爬蟲來做下載器，下載二進位檔案的內容。

或者爬取到的資料沒有馬上要用，所以先儲存起來。

很高興資料儲存並不是一個困難的議題，這個章節將快速提到一下該如何儲存各種資料。

二進位

以下程式碼，我們去請求一張圖片資源，並且在開檔案的時候以二進位寫入 ( wb ) 的方式

圖片是上一篇文章的終端機截圖

import requests


response = requests.get('https://d1dwq032kyr03c.cloudfront.net/upload/images/20220921/20150913r5VCCkBLnC.png')

with open('image.png', 'wb') as f:
    f.write(response.content)

以上的程式碼，也可以只用 wget 去完成。

$ wget "https://d1dwq032kyr03c.cloudfront.net/upload/images/20220921/20150913r5VCCkBLnC.png"

response 是什麼東西 ? 為什麼使用 response.content ?

如果我們把程式碼改成以下這樣。

不難發現到，response 就是 Python requests 函式庫的一個 requests.models.Response 類別。

response.text 是字串型態，而一個請求結束得到的字串，那不就是 HTML 嗎 !?

當然，如果你很確定你請求的東西是 JSON 或 XML 之類的資料結構，那就是 JSON 或 XML，請不要懷疑自己。

response.content 有如字面上的意思，就是這個 response 的內容，算是一個比較抽象的意思。

你會發現它的型別是 bytes。這是為什麼我們在儲存二進位檔案的時候會用 response.content，因為圖片本來就不能用字串表示。

import requests


response = requests.get('https://d1dwq032kyr03c.cloudfront.net/upload/images/20220921/20150913r5VCCkBLnC.png')

print(type(response))
print(type(response.text))
print(type(response.content))

# Output : 
#
# <class 'requests.models.Response'>
# <class 'str'>
# <class 'bytes'>

HTML

如果你想把我上一篇文章的 HTML 存起來，你可以這樣寫。

恩 ... 請不要懷疑自己的眼睛，要儲存 HTML 真的就是那麼簡單。就像我剛剛說的，response.text 就是 HTML 了。

在這段程式碼我加了 User-Agent，因為不加會被 iT 邦幫忙以 403 Forbidden 拒絕掉。

import requests


response = requests.get(
    url     = 'https://ithelp.ithome.com.tw/articles/10295353',
    headers = {'User-Agent' : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0'}
)

with open('example.html', 'w+') as f:
    f.write(response.text)

還有一個更懶惰的作法，可以把程式碼改成以下這樣。

import requests


response = requests.get(
    url     = 'https://ithelp.ithome.com.tw/articles/10295353',
    headers = {'User-Agent' : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0'}
)

print(response.text)

並且在執行的時候，把 stdout 的結果重導向進 example.html，就可以得到一樣的效果。

$ python3 test.py > example.html

最懶惰的作法就是連 Python code 都不要寫了，反正只要是 HTTP client 就行了對吧 ?

$ curl -A "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0" \
          "https://ithelp.ithome.com.tw/articles/10295353" > example.html

在終端機輸入網址的陷阱

這邊用 HTTPBin 的 API 資源來實驗一下。

你會發現 response 是一個 JSON 的資料結構，而且 args 是空的。

origin 是我的 IP 位址，所以我遮掉了。

$ curl https://httpbin.org/get

# Output :
#
# {
#     "args": {},
#     "headers": {
#         "Accept": "*/*",
#         "Host": "httpbin.org",
#         "User-Agent": "curl/7.74.0",
#         "X-Amzn-Trace-Id": "Root=1-632bda12-452356e569ca6c955e258d23"
#     },
#     "origin": "xxx.xxx.xxx.xxx",
#     "url": "https://httpbin.org/get"
# }

如果加入 query string parameters，不難發現，args 原來就是紀錄著這個。

$ curl https://httpbin.org/get?username=username

# Output :
#
# {
#     "args": {
#         "username": "username"
#     },
#     "headers": {
#         "Accept": "*/*",
#         "Host": "httpbin.org",
#         "User-Agent": "curl/7.74.0",
#         "X-Amzn-Trace-Id": "Root=1-632bdb70-3e93e859301bdc3a7c31731d"
#     },
#     "origin": "xxx.xxx.xxx.xxx",
#     "url": "https://httpbin.org/get?username=username"
# }

當 query string parameters 加到兩個或以上的時候，會發現 output 看起來怪怪的，而且終端機疑似卡住了 !?

其中的陷阱在於，& 對 shell 來說其實是背景執行的意思，所以才會看起來卡住了。而 739609 其實是 process id。

這也是為什麼 password=password 沒有被當成網址的一部份處理。

$ curl https://httpbin.org/get?username=username&password=password

# Output :
#
# [1] 739609
#
# {
#     "args": {
#         "username": "username"
#     },
#     "headers": {
#         "Accept": "*/*",
#         "Host": "httpbin.org",
#         "User-Agent": "curl/7.74.0",
#         "X-Amzn-Trace-Id": "Root=1-632bdd77-3211a0cb3725b0b21a02d484"
#     },
#     "origin": "xxx.xxx.xxx.xxx",
#     "url": "https://httpbin.org/get?username=username"
# }

解決辦法很簡單，就是每當遇到有 query string parameters 的網址都以引號處理。

$ curl "https://httpbin.org/get?username=username&password=password"

# Output :
#
# {
#     "args": {
#         "password": "password",
#         "username": "username"
#     },
#     "headers": {
#         "Accept": "*/*",
#         "Host": "httpbin.org",
#         "User-Agent": "curl/7.74.0",
#         "X-Amzn-Trace-Id": "Root=1-632be109-29bc0ad3365848ee4abdcc53"
#     },
#     "origin": "xxx.xxx.xxx.xxx",
#     "url": "https://httpbin.org/get?username=username&password=password"
# }

把 HTML 存起來的妙用

如果接下來需要用到 Beautiful Soup 去處理 HTML，起手式可能會像是以下程式碼。

並且可能會寫一些 print 來反覆確認，我們有正確的抓到我們想要的節點。

import requests
from bs4 import BeautifulSoup


response = requests.get(
    url     = 'https://ithelp.ithome.com.tw/articles/10295353',
    headers = {'User-Agent' : 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0'}
)

soup = BeautifulSoup(response.text, 'html.parser')

無論網路在怎麼快，也快不過本地的檔案系統。

如果有反覆測試的需求，不妨把程式碼寫成以下這樣，可以以較高的效率執行你想測試的部份。

import requests
from bs4 import BeautifulSoup


with open('example.html', 'r') as f:
    html = f.read()

soup = BeautifulSoup(html, 'html.parser')

複合資料型態

我們會遇到一種情況，就是要儲存的資料非常複雜，這時候請不要害怕使用資料庫這個東西。

以 SQL 語言來說，有很多種資料庫，但規模都過於龐大。而其中 SQLite 是一個輕便又不連網路的資料庫。

即使只是需要簡單的資料儲存，不想要讓資料關聯，我認為也可以把 SQLite 當成 NoSQL 來使用。

參考文章

sqlite3 — DB-API 2.0 interface for SQLite databases

Day 6 速成爬蟲的第四課 : 送出表單

Day 8 初探 mitmproxy

系列文

mitmproxy 在網路爬蟲上的各種應用共 18 篇

RSS系列文訂閱系列文

10 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19862 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

mitmproxy 在網路爬蟲上的各種應用系列 第 7 篇