了解 requests 與 BeautifulSoup 的功能後,我們來進行整合吧!接下來我們會以 cookpad 這個料理網站來進行爬蟲
import requests
from bs4 import BeautifulSoup
response = requests.get(https://cookpad.com/tw)
print(response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
print(soup)
403
<!DOCTYPE html>
<html dir="ltr" lang="zh-TW">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>很抱歉,此頁面讀取失敗。</title>
<script>
if (typeof Turbolinks !== 'undefined') {
location.reload();
}
</script>
<!-- Px -->
<script>
window._pxAppId = 'PXFqtAw5et';
window._pxJsClientSrc = '/FqtAw5et/init.js';
window._pxFirstPartyEnabled = true;
window._pxVid = '';
window._pxUuid = '32b1303b-2908-11ec-af10-7267594b6b65';
window._pxHostUrl = '/FqtAw5et/xhr';
window._PXFqtAw5et = {
locale: 'zh-TW',
translation: {
'zh-TW': [
{
"selector": "#px-form-head span",
"text": "遇到問題 ? 請提供更多資訊"
},
{
"selector": "#px-form div label[for=opt1]",
"text": "我沒有看到任何驗證碼"
},
{
"selector": "#px-form div label[for=opt2]",
"text": "我已解決驗證碼問題,但又出現另一組驗證碼"
},
{
"selector": "#px-form div label[for=opt3]",
"text": "我已解決多個驗證碼問題,但仍無法進入該連結"
},
{
"selector": "#px-form div label[for=opt4]",
"text": "其他(請詳細說明)"
},
{
"selector": "#px-form h4:nth-of-type(1)",
"text": "附加資訊:"
},
{
"selector": "#px-form-submit",
"text": "發送"
}
]
}
};
</script>
<script defer="" src="/FqtAw5et/captcha/captcha.js?a=c&u=32b1303b-2908-11ec-af10-7267594b6b65&v=&m=0"></script>
<!-- Custom Script -->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300" rel="stylesheet"/>
<style>
html, body {
margin: 0;
padding: 0;
font-family: 'Open Sans', sans-serif;
color: #000;
}
a {
color: #c5c5c5;
text-decoration: none;
}
.container {
align-items: center;
display: flex;
flex: 1;
justify-content: space-between;
flex-direction: column;
height: 100%;
}
.container > div {
width: 100%;
display: flex;
justify-content: center;
}
.container > div > div {
display: flex;
width: 80%;
}
.customer-logo-wrapper {
padding-top: 2rem;
flex-grow: 0;
background-color: #fff;
visibility: (null);
}
.customer-logo {
border-bottom: 1px solid #000;
}
.customer-logo > img {
padding-bottom: 1rem;
max-height: 50px;
max-width: 100%;
}
.page-title-wrapper {
flex-grow: 2;
}
.page-title {
flex-direction: column-reverse;
}
.content-wrapper {
flex-grow: 5;
}
.content {
flex-direction: column;
}
.page-footer-wrapper {
align-items: center;
flex-grow: 0.2;
background-color: #000;
color: #c5c5c5;
font-size: 70%;
}
@media (min-width: 768px) { html, body { height: 100%; } }
</style>
<!-- Custom CSS -->
</head>
<body>
<section class="container">
<div class="customer-logo-wrapper">
<div class="customer-logo"><img alt="Logo" src="https://assets-global.cpcdn.com/assets/logo_cookpad_large-827bc0b34d5c7ab322d3ff8de882e9f828d06bc5ae46d09c88d25aaf02686132.png"/></div>
</div>
<div class="page-title-wrapper">
<div class="page-title">
<h1>請確認您不是機器人</h1>
</div>
</div>
<div class="content-wrapper">
<div class="content">
<div id="px-captcha"></div>
<p>很抱歉,此頁面讀取失敗。系統偵測到您的電腦網路發出異常流量。</p> <p>可能會發生下列情況:</p> <ul> <li>Javascript 因某個擴充軟件失效或者被阻擋。例如:ad blockers</li> <li>您的瀏覽器不支持 cookie</li> </ul> <p>請確認開啟Javascript 和 cookies,以確保瀏覽順利。</p>
<p> Ref ID: #32b1303b-2908-11ec-af10-7267594b6b65 </p>
</div>
</div>
<div class="page-footer-wrapper">
<div class="page-footer">
<p> Powered by <a href="https://www.perimeterx.com/whywasiblocked">PerimeterX</a> , Inc </p>
</div>
</div>
</section>
</body>
</html>
為什麼會出現以上問題?
擷取網頁的 status 回傳代碼是 403,代表的意思是「伺服器成功解析請求但是客戶端沒有存取該資源的權限」,也就是我們被發現是機器人,然後被擋下來了!
該怎麼辦呢?難道就不能爬蟲了嗎?
當然不是!我們怎麼可以這麼輕易被打敗呢!一山還有一山高,網頁發現我們是機器人的身份,那我們就創一個假的 header 給他,讓他以為其實是真人在操作就好!
import requests
from bs4 import BeautifulSoup
# 使用假header
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)'
headers = {'User-Agent': user_agent}
response = requests.get('https://cookpad.com/tw', headers=headers)
print(response.status_code)
soup = BeautifulSoup(response.content, "html.parser")
print(soup)
透過假 header 我們就可以成功爬到網頁內容拉!!因為內容實在太多了,就沒有全部放上來了~
200
<!DOCTYPE html>
<html class="js js--off" dir="ltr" lang="zh">
<head>
<title>Cookpad 全球最大食譜社群-超過5百萬道家常料理|天天享受烹飪趣!</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport">
<meta content="Cookpad" property="og:site_name"/>
<meta content="284DFFB29E8DABE16C08409C5C68F3C6" name="msvalidate.01"/>
<meta content="ZpmT8328xJFtRmIaSnJmAEnseeQQik7RTa2VfTs14ag" name="google-site-verification"/>
<meta content="Cookpad 全球最大食譜社群-超過5百萬道家常料理|天天享受烹飪趣!" property="og:title"/><meta content="找食譜嗎?尋找平台記錄自己的私房料理嗎?這邊的食譜通通任你免費瀏覽和收藏。歡迎加入這個的料理社群,一起天天享受烹飪趣!" name="description"/><meta content="找食譜嗎?尋找平台記錄自己的私房料理嗎?這邊的食譜通通任你免費瀏覽和收藏。歡迎加入這個的料理社群,一起天天享受烹飪趣!" property="og:description"/><meta content="//assets-global.cpcdn.com/assets/logo_ogp-cd3e10480377d7af945a23f409e7d311ced9cda1984e881875c74e555fadbc2f.png" property="og:image"/><meta content="1200" property="og:image:width"/><meta content="630" property="og:image:height"/><link href="https://cookpad.com/tw" rel="canonical"/><link href="https://cookpad.com/tw" hreflang="zh-tw" rel="alternate"/><meta content="https://cookpad.com/tw" property="og:url"/>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="PCc0KK5zF0zqCXaAx5Rvx3yTrql-2LVAz4BXNrc6NlyYgvzQoFjj5j3gOQQMREFtrrYNHtvLXafr5AfUdGs9ew" name="csrf-token"/>
<script>
//<![CDATA[
window.LOCALE = 'zh-tw'
//]]>
</script>
<link data-turbolinks-track="reload" href="//assets-global.cpcdn.com/packs/css/v2/application-23dbf47e.css" media="all" rel="stylesheet"/>
<link data-turbolinks-track="reload" href="//assets-global.cpcdn.com/packs/css/print-ffebe649.css" media="print" rel="stylesheet"/>
<style media="all" type="text/css">
[data-visible-to] { display: none; }
[data-hidden-from-guest] { display: none; }
</style>
<script type="text/javascript">
document.documentElement.className = document.documentElement.className.replace("js--off","js--on")
</script>
<script type="text/javascript">
window.__webpack_public_path__ = "//assets-global.cpcdn.com/packs/"
</script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/0-77a18e050e7a38661f11.chunk.js"></script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/3-be1654a56f358d39438b.chunk.js"></script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/application-8812d5e182bd85394e80.js"></script>
<script type="text/javascript">
(function(){
window._pxAppId = 'PXFqtAw5et';
var p = document.getElementsByTagName('script')[0],
s = document.createElement('script');
s.async = 1;
s.src = '/FqtAw5et/init.js';
p.parentNode.insertBefore(s,p);
}());
</script>
<link href="https://use.typekit.net/zbz2cyk.css" rel="stylesheet"/>
<script>
(function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,"script","//www.google-analytics.com/analytics.js","ga");
</script>
而假 header 其實也有套件可以使用!我們可以使用一個叫 fake-useragent 的套件來自動產生假的 header
一樣先開啟終端機輸入以下指令安裝:
pip install fake-useragent
他的使用方法也很簡單,如下
from fake_useragent import UserAgent
ua = UserAgent()
接下來就可以根據需求選擇你想要的 header
# Safari 的 UA
user_agent = ua.safari
# IE 的 UA
user_agent = ua.ie
# Chrome 的 UA
user_agent = ua.chrome
# 隨機產生的 UA
user_agent = ua.random
透過 fake-useragent 自動生成假的 header 就可以解決被網頁認出是機器人的問題了!