iT邦幫忙

2021 iThome 鐵人賽

DAY 28
1
Modern Web

陪聊_伃時不候 Line Bot 聊天機器人系列 第 28

使用 Python 實作網路爬蟲 part 3

實際操作

了解 requests 與 BeautifulSoup 的功能後,我們來進行整合吧!接下來我們會以 cookpad 這個料理網站來進行爬蟲

import requests
from bs4 import BeautifulSoup

response = requests.get(https://cookpad.com/tw)
print(response.status_code)

soup = BeautifulSoup(response.content, "html.parser")
print(soup)
403

<!DOCTYPE html>

<html dir="ltr" lang="zh-TW">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>很抱歉,此頁面讀取失敗。</title>
<script>
          if (typeof Turbolinks !== 'undefined') {
            location.reload();
          }
        </script>
<!-- Px -->
<script>
          window._pxAppId = 'PXFqtAw5et';
          window._pxJsClientSrc = '/FqtAw5et/init.js';
          window._pxFirstPartyEnabled = true;
          window._pxVid = '';
          window._pxUuid = '32b1303b-2908-11ec-af10-7267594b6b65';
          window._pxHostUrl = '/FqtAw5et/xhr';
          window._PXFqtAw5et = {
            locale: 'zh-TW',
            translation: {
              'zh-TW': [
        {
            "selector": "#px-form-head span",
            "text": "遇到問題 ? 請提供更多資訊"
        },
        {
            "selector": "#px-form div label[for=opt1]",
            "text": "我沒有看到任何驗證碼"
        },
        {
            "selector": "#px-form div label[for=opt2]",
            "text": "我已解決驗證碼問題,但又出現另一組驗證碼"
        },
        {
            "selector": "#px-form div label[for=opt3]",
            "text": "我已解決多個驗證碼問題,但仍無法進入該連結"
        },
        {
            "selector": "#px-form div label[for=opt4]",
            "text": "其他(請詳細說明)"
        },
        {
            "selector": "#px-form h4:nth-of-type(1)",
            "text": "附加資訊:"
        },
        {
            "selector": "#px-form-submit",
            "text": "發送"
        }
    ]
            }
          };
        </script>
<script defer="" src="/FqtAw5et/captcha/captcha.js?a=c&u=32b1303b-2908-11ec-af10-7267594b6b65&v=&m=0"></script>
<!-- Custom Script -->
<link href="https://fonts.googleapis.com/css?family=Open+Sans:300" rel="stylesheet"/>
<style>
          html, body {
            margin: 0;
            padding: 0;
            font-family: 'Open Sans', sans-serif;
            color: #000;
          }
          a {
            color: #c5c5c5;
            text-decoration: none;
          }
          .container {
            align-items: center;
            display: flex;
            flex: 1;
            justify-content: space-between;
            flex-direction: column;
            height: 100%;
          }
          .container > div {
            width: 100%;
            display: flex;
            justify-content: center;
          }
          .container > div > div {
            display: flex;
            width: 80%;
          }
          .customer-logo-wrapper {
            padding-top: 2rem;
            flex-grow: 0;
            background-color: #fff;
            visibility: (null);
          }
          .customer-logo {
            border-bottom: 1px solid #000;
          }
          .customer-logo > img {
            padding-bottom: 1rem;
            max-height: 50px;
            max-width: 100%;
          }
          .page-title-wrapper {
            flex-grow: 2;
          }
          .page-title {
            flex-direction: column-reverse;
          }
          .content-wrapper {
            flex-grow: 5;
          }
          .content {
            flex-direction: column;
          }
          .page-footer-wrapper {
            align-items: center;
            flex-grow: 0.2;
            background-color: #000;
            color: #c5c5c5;
            font-size: 70%;
          }
          @media (min-width: 768px) { html, body { height: 100%; } }
        </style>
<!-- Custom CSS -->
</head>
<body>
<section class="container">
<div class="customer-logo-wrapper">
<div class="customer-logo"><img alt="Logo" src="https://assets-global.cpcdn.com/assets/logo_cookpad_large-827bc0b34d5c7ab322d3ff8de882e9f828d06bc5ae46d09c88d25aaf02686132.png"/></div>
</div>
<div class="page-title-wrapper">
<div class="page-title">
<h1>請確認您不是機器人</h1>
</div>
</div>
<div class="content-wrapper">
<div class="content">
<div id="px-captcha"></div>
<p>很抱歉,此頁面讀取失敗。系統偵測到您的電腦網路發出異常流量。</p> <p>可能會發生下列情況:</p> <ul> <li>Javascript 因某個擴充軟件失效或者被阻擋。例如:ad blockers</li> <li>您的瀏覽器不支持 cookie</li> </ul> <p>請確認開啟Javascript 和 cookies,以確保瀏覽順利。</p>
<p> Ref ID: #32b1303b-2908-11ec-af10-7267594b6b65 </p>
</div>
</div>
<div class="page-footer-wrapper">
<div class="page-footer">
<p> Powered by <a href="https://www.perimeterx.com/whywasiblocked">PerimeterX</a> , Inc </p>
</div>
</div>
</section>
</body>
</html>

為什麼會出現以上問題?
擷取網頁的 status 回傳代碼是 403,代表的意思是「伺服器成功解析請求但是客戶端沒有存取該資源的權限」,也就是我們被發現是機器人,然後被擋下來了!

該怎麼辦呢?難道就不能爬蟲了嗎?

當然不是!我們怎麼可以這麼輕易被打敗呢!一山還有一山高,網頁發現我們是機器人的身份,那我們就創一個假的 header 給他,讓他以為其實是真人在操作就好!

import requests
from bs4 import BeautifulSoup

# 使用假header
user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0)'
headers = {'User-Agent': user_agent}

response = requests.get('https://cookpad.com/tw', headers=headers)
print(response.status_code)

soup = BeautifulSoup(response.content, "html.parser")
print(soup)

透過假 header 我們就可以成功爬到網頁內容拉!!因為內容實在太多了,就沒有全部放上來了~

200
<!DOCTYPE html>

<html class="js js--off" dir="ltr" lang="zh">
<head>
<title>Cookpad 全球最大食譜社群-超過5百萬道家常料理|天天享受烹飪趣!</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="viewport-fit=cover, width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport">
<meta content="Cookpad" property="og:site_name"/>
<meta content="284DFFB29E8DABE16C08409C5C68F3C6" name="msvalidate.01"/>
<meta content="ZpmT8328xJFtRmIaSnJmAEnseeQQik7RTa2VfTs14ag" name="google-site-verification"/>
<meta content="Cookpad 全球最大食譜社群-超過5百萬道家常料理|天天享受烹飪趣!" property="og:title"/><meta content="找食譜嗎?尋找平台記錄自己的私房料理嗎?這邊的食譜通通任你免費瀏覽和收藏。歡迎加入這個的料理社群,一起天天享受烹飪趣!" name="description"/><meta content="找食譜嗎?尋找平台記錄自己的私房料理嗎?這邊的食譜通通任你免費瀏覽和收藏。歡迎加入這個的料理社群,一起天天享受烹飪趣!" property="og:description"/><meta content="//assets-global.cpcdn.com/assets/logo_ogp-cd3e10480377d7af945a23f409e7d311ced9cda1984e881875c74e555fadbc2f.png" property="og:image"/><meta content="1200" property="og:image:width"/><meta content="630" property="og:image:height"/><link href="https://cookpad.com/tw" rel="canonical"/><link href="https://cookpad.com/tw" hreflang="zh-tw" rel="alternate"/><meta content="https://cookpad.com/tw" property="og:url"/>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="PCc0KK5zF0zqCXaAx5Rvx3yTrql-2LVAz4BXNrc6NlyYgvzQoFjj5j3gOQQMREFtrrYNHtvLXafr5AfUdGs9ew" name="csrf-token"/>
<script>
//<![CDATA[
window.LOCALE = 'zh-tw'
//]]>
</script>
<link data-turbolinks-track="reload" href="//assets-global.cpcdn.com/packs/css/v2/application-23dbf47e.css" media="all" rel="stylesheet"/>
<link data-turbolinks-track="reload" href="//assets-global.cpcdn.com/packs/css/print-ffebe649.css" media="print" rel="stylesheet"/>
<style media="all" type="text/css">
      [data-visible-to] { display: none; }
      [data-hidden-from-guest] { display: none; }
  </style>
<script type="text/javascript">
        document.documentElement.className = document.documentElement.className.replace("js--off","js--on")
      </script>
<script type="text/javascript">
  window.__webpack_public_path__ = "//assets-global.cpcdn.com/packs/"
</script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/0-77a18e050e7a38661f11.chunk.js"></script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/3-be1654a56f358d39438b.chunk.js"></script>
<script data-turbolinks-track="reload" src="//assets-global.cpcdn.com/packs/js/application-8812d5e182bd85394e80.js"></script>
<script type="text/javascript">
      (function(){
          window._pxAppId = 'PXFqtAw5et';
          var p = document.getElementsByTagName('script')[0],
              s = document.createElement('script');
          s.async = 1;
          s.src = '/FqtAw5et/init.js';
          p.parentNode.insertBefore(s,p);
      }());
  </script>
<link href="https://use.typekit.net/zbz2cyk.css" rel="stylesheet"/>
<script>
  (function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,"script","//www.google-analytics.com/analytics.js","ga");
</script>

而假 header 其實也有套件可以使用!我們可以使用一個叫 fake-useragent 的套件來自動產生假的 header

一樣先開啟終端機輸入以下指令安裝:

pip install fake-useragent

他的使用方法也很簡單,如下

from fake_useragent import UserAgent

ua = UserAgent()

接下來就可以根據需求選擇你想要的 header

# Safari 的 UA
user_agent = ua.safari

# IE 的 UA
user_agent = ua.ie

# Chrome 的 UA
user_agent = ua.chrome

# 隨機產生的 UA
user_agent = ua.random

透過 fake-useragent 自動生成假的 header 就可以解決被網頁認出是機器人的問題了!


上一篇
使用 Python 實作網路爬蟲 part 2
下一篇
linebot 結合網路爬蟲
系列文
陪聊_伃時不候 Line Bot 聊天機器人30

1 則留言

0
juck30808
iT邦新手 3 級 ‧ 2021-10-14 12:12:11

恭喜即將邁入完賽~/images/emoticon/emoticon08.gif

我要留言

立即登入留言