iT邦幫忙

2022 iThome 鐵人賽

DAY 9
0
自我挑戰組

養爬蟲的人學爬蟲系列 第 9

【Day 9】第一個爬蟲函式庫-Requests

  • 分享至 

  • xImage
  •  

閒聊
在經過前8天的基礎練習之後,今天終於要進入到第一次爬蟲了!
決定從書籍和網路上找尋第一隻函式庫,最後選定Requests。

Requests函式庫

  • 什麼是Requests?
    Requests是Python HTTP庫(外部函式庫),目的是使HTTP請求更簡單、更人性化,且具備了GET、POST...等用法。

  • 安裝

pip install requests
  • 使用方法
  1. 引入
import requests

或是

from request import get
  1. 請求
  • get
    向指定資源提交請求,可以設定params參數字典。

  • post
    向指定資源提交請求,可以設定data參數字典。

  • put
    向指定資源提供最新內容,可以設定data參數字典。

r = requests.put('url', data = {'key':'value'})
  • delete
    請求刪除指定資源。
r = requests.delete('url')
  • head
    請求資源回應開頭
r = requests.head('url')
  • options
    請求伺服器提供可用的功能選項。
r = requests.options('url')
  1. 回應資料分析(屬性與方法)
print(r.text) #回應訊息str(字串)

print(r.encoding) #回應、指定訊息編碼

print(r.url) #回應資源的URL地址

print(r.stutus_code) #回應狀態(int)

print(r.header) #回應訊息的標題(dict)

print(r.cookie) #回應訊息的cookie(dict)

print(r.history) #請求歷史(list)

4.JSON資料
如果取得的資料是JSON格式,可以使用.josn()將訊息解碼後回傳(dict)。

r = requests.get('url')
r.josn()

5.自訂Header
有時候網站會擋掉python-request的請求,因此會需要自訂Header。
其中需要注意的是,變數名字中間只能用-分隔,跟平常習慣的_是不一樣的!
並且,非標準協定頭欄位需要加上x-作為標示。

url = 'URL地址'
headers = {'變數':'變數'}
r = r.request(url, headers = headers)

6.Timeout
可以用來檢查是否可以存取,或是避免在維修中或是故障的網站停留。

requests.get('url',timeout = [SECOND]) #以「秒」為單位。

7.取得及修改Cookie

#取得
url = 'URL'
r = requests.get(url)
r.cookie['example_cookie_name'] 

#修改
url = 'URL'
cookie = dict(cookie_are = '')
r = request.get(url, cookie = cookie)
r.text

HTTPS 狀態代碼

狀態代碼 說明
200 網頁正常
301 網頁搬家,會重新導向新的URL
302 暫時移到新位置
400 錯誤的要求
401 未授權,需攜帶憑證
403 沒有權限
404 找不到網頁
500 伺服器錯誤
502 伺服器某個服務沒有正確執行
503 伺服器暫時無法處理請求(流量附載過大)
504 伺服器沒有回應

Requests GET
GET提交的參數會在標頭中傳送(公開)
第一步先來請求get

import requests
url = 'https://www.google.com.tw/?hl=zh_TW' 
r = requests.get(url) #對url發送GET請求
print(type(url),r) #印出狀態
#output
<class 'str'> <Response [200]>

GET請求中,如果請求攜帶參數會直接放在網址中(url)?後面。如果有多個參數,就以&相隔。
例如https://www.google.com.tw/?hl=zh_TW這裡只有一個參數,所以是接在?後面。這個參數為h1,參數值則是zh_TW

  • 帳號密碼登錄
    有時候會遇到需要帳號、密碼才能看到的網頁。這時候可以用auth指定帳號、密碼。
r = requests.get('url', auth = ('user', 'pass'))

Requests POST
POST提交的參數會在內容中傳送(隱密)
大部分會應用在網頁讓使用者填入資料的表單,使用POST來做請求。
第一步一樣先來做請求

import requests
mydata = {'key':'vaule'}
r = request.post('url', data = mydata) #將資料加入post請求中
  • 上傳檔案
import requests
myfile = {'myfile':open('myfile.docx','rb')} #要上傳的檔案
r = requests.post('url',file = myfile) #將檔案加入post請求中

第一次爬蟲
以Google中文版網頁做練習,不過這樣的爬蟲只有印出回應的字串而已,還不能做資料清洗或是定位等事情,並且列印出來的字串不容易閱讀。

import requests
url = 'https://www.google.com.tw/?hl=zh_TW'
r = requests.get(url)
print(r.text)
#output
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="zh-TW"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){window.google={kEI:'ac0rY4vZKNHu-AbgtKPoCw',kEXPI:'0,1302536,56873,6059,206,4804,2316,383,246,5,5367,1123753,1197748,380743,16114,17444,1954,9286,22431,1361,283,12033,17583,4998,13228,3847,10622,22741,5081,885,708,1279,2742,149,1103,840,1983,213,4101,3514,606,2023,1777,520,14670,3227,2845,7,5599,28171,1851,2614,13142,3,346,230,6460,148,13975,4,1528,2304,27348,7422,7356,13659,4437,16786,5815,2542,4097,4049,3,3541,1,42160,2,14016,6249,7867,11623,6700,951,1429,14023,14719,4568,6258,23418,1252,5835,14968,4332,20,7464,445,2,2,1,6960,19672,8155,6582,799,14680,1289,873,14802,1,4831,7,1922,9779,19130,12192,4832,1520,6414,5091,3007,984,122,700,4,1,2,2,2,2,5952,2450,6721,238,2085,3065,5930,2348,14,82,949,1759,1182,751,446,1624,5356,1493,1030,2412,922,666,198,4,305,763,220,36,563,987,411,1541,1867,165,182,143,3,3,2,2,400,1073,563,555,1,1520,220,547,15,1645,675,1226,7,1,61,343,547,352,5,39,424,126,384,164,471,7,217,268,313,323,399,286,3,1,597,120,308,57,180,590,35,4,118,104,1548,538,5343604,656,130,5995858,2803379,3311,141,795,19736,1,298,48,1570,83,1,3,3,1,1,1,20728802,33,3219986,4042143,1964,3094,2321,11258,3405,5543',kBL:'x9VE'};google.sn='webhp';google.kHL='zh-TW';})();(function(){
var f=this||self;var h,k=[];function l(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||h}function m(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b}
function n(a,b,c,d,g){var e="";c||-1!==b.search("&ei=")||(e="&ei="+l(d),-1===b.search("&lei=")&&(d=m(d))&&(e+="&lei="+d));d="";!c&&f._cshid&&-1===b.search("&cshid=")&&"slh"!==a&&(d="&cshid="+f._cshid);c=c||"/"+(g||"gen_204")+"?atyp=i&ct="+a+"&cad="+b+e+"&zx="+Date.now()+d;/^http:/i.test(c)&&"https:"===window.location.protocol&&(google.ml&&google.ml(Error("a"),!1,{src:c,glmm:1}),c="");return c};h=google.kEI;google.getEI=l;google.getLEI=m;google.ml=function(){return null};google.log=function(a,b,c,d,g){if(c=n(a,b,c,d,g)){a=new Image;var e=k.length;k[e]=a;a.onerror=a.onload=a.onabort=function(){delete k[e]};a.src=c}};google.logUrl=n;}).call(this);(function(){
google.y={};google.sy=[];google.x=function(a,b){if(a)var c=a.id;else{do c=Math.random();while(google.y[c])}google.y[c]=[a,b];return!1};google.sx=function(a){google.sy.push(a)};google.lm=[];google.plm=function(a){google.lm.push.apply(google.lm,a)};google.lq=[];google.load=function(a,b,c){google.lq.push([[a],b,c])};google.loadAll=function(a,b){google.lq.push([a,b])};google.bx=!1;google.lx=function(){};}).call(this);google.f={};(function(){
document.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a="1"===c||"q"===c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentElement.addEventListener("click",function(b){var a;a:{for(a=b.target;a&&a!==document.documentElement;a=a.parentElement)if("A"===a.tagName){a="1"===a.getAttribute("data-nohref");break a}a=!1}a&&b.preventDefault()},!0);}).call(this);</script><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><style>body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;overflow-y:scroll}#gog{padding:3px 8px 0}td{line-height:.8em}.gac_m td{line-height:17px}form{margin-bottom:20px}.h{color:#1558d6}em{color:#c5221f;font-style:normal;font-weight:normal}a em{text-decoration:underline}.lst{height:25px;width:496px}.gsfi,.lst{font:18px arial,sans-serif}.gsfs{font:17px arial,sans-serif}.ds{display:inline-box;display:inline-block;margin:3px 0 4px;margin-left:4px}input{font-family:inherit}body{background:#fff;color:#000}a{color:#4b11a8;text-decoration:none}a:hover,a:active{text-decoration:underline}.fl a{color:#1558d6}a:visited{color:#4b11a8}.sblc{padding-top:5px}.sblc a{display:block;margin:2px 0;margin-left:13px;font-size:11px}.lsbb{background:#f8f9fa;border:solid 1px;border-color:#dadce0 #70757a #70757a #dadce0;height:30px}.lsbb{display:block}#WqQANb a{display:inline-block;margin:0 12px}.lsb{background:url(/images/nav_logo229.png) 0 -261px repeat-x;border:none;color:#000;cursor:pointer;height:30px;margin:0;outline:0;font:15px arial,sans-serif;vertical-align:top}.lsb:active{background:#dadce0}.lst:focus{outline:none}</style><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){window.google.erd={jsr:1,bv:1657,de:true};
var h=this||self;var k,l=null!=(k=h.mei)?k:1,n,p=null!=(n=h.sdo)?n:!0,q=0,r,t=google.erd,v=t.jsr;google.ml=function(a,b,d,m,e){e=void 0===e?2:e;b&&(r=a&&a.message);if(google.dl)return google.dl(a,e,d),null;if(0>v){window.console&&console.error(a,d);if(-2===v)throw a;b=!1}else b=!a||!a.message||"Error loading script"===a.message||q>=l&&!m?!1:!0;if(!b)return null;q++;d=d||{};b=encodeURIComponent;var c="/gen_204?atyp=i&ei="+b(google.kEI);google.kEXPI&&(c+="&jexpid="+b(google.kEXPI));c+="&srcpg="+b(google.sn)+"&jsr="+b(t.jsr)+"&bver="+b(t.bv);var f=a.lineNumber;void 0!==f&&(c+="&line="+f);var g=
a.fileName;g&&(0<g.indexOf("-extension:/")&&(e=3),c+="&script="+b(g),f&&g===window.location.href&&(f=document.documentElement.outerHTML.split("\n")[f],c+="&cad="+b(f?f.substring(0,300):"No script found.")));c+="&jsel="+e;for(var u in d)c+="&",c+=b(u),c+="=",c+=b(d[u]);c=c+"&emsg="+b(a.name+": "+a.message);c=c+"&jsst="+b(a.stack||"N/A");12288<=c.length&&(c=c.substr(0,12288));a=c;m||google.log(0,"",a);return a};window.onerror=function(a,b,d,m,e){r!==a&&(a=e instanceof Error?e:Error(a),void 0===d||"lineNumber"in a||(a.lineNumber=d),void 0===b||"fileName"in a||(a.fileName=b),google.ml(a,!1,void 0,!1,"SyntaxError"===a.name||"SyntaxError"===a.message.substring(0,11)||0<a.message.indexOf("Script error")?2:0));r=null;p&&q>=l&&(window.onerror=null)};})();</script></head><body bgcolor="#fff"><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){var src='/images/nav_logo229.png';var iesg=false;document.body.onload = function(){window.n && window.n();if (document.images){new Image().src=src;}
if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}
}
})();</script><div id="mngb"><div id=gbar><nobr><b class=gb1>搜尋</b> <a class=gb1 href="https://www.google.com.tw/imghp?hl=zh-TW&tab=wi">圖片</a> <a class=gb1 href="https://maps.google.com.tw/maps?hl=zh-TW&tab=wl">地圖</a> <a class=gb1 href="https://play.google.com/?hl=zh-TW&tab=w8">Play</a> <a class=gb1 href="https://www.youtube.com/?tab=w1">YouTube</a> <a class=gb1 href="https://news.google.com/?tab=wn">新聞</a> <a class=gb1 href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class=gb1 href="https://drive.google.com/?tab=wo">雲端硬碟</a> <a class=gb1 style="text-decoration:none" href="https://www.google.com.tw/intl/zh-TW/about/products?tab=wh"><u>更多</u> &raquo;</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a href="http://www.google.com.tw/history/optout?hl=zh-TW" class=gb4>網頁記錄</a> | <a  href="/preferences?hl=zh-TW" class=gb4>設定</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=zh-TW&passive=true&continue=https://www.google.com.tw/%3Fhl%3Dzh_TW&ec=GAZAAQ" class=gb4>登入</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div></div><center><br clear="all" id="lgpd"><div id="lga"><img alt="Google" height="92" src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png" style="padding:28px 0 14px" width="272" id="hplogo"><br><br></div><form action="/search" name="f"><table cellpadding="0" cellspacing="0"><tr valign="top"><td width="25%">&nbsp;</td><td align="center" nowrap=""><input name="ie" value="Big5" type="hidden"><input value="zh-TW" name="hl" type="hidden"><input name="source" type="hidden" value="hp"><input name="biw" type="hidden"><input name="bih" type="hidden"><div class="ds" style="height:32px;margin:4px 0"><input class="lst" style="margin:0;padding:5px 8px 0 6px;vertical-align:top;color:#000" autocomplete="off" value="" title="Google 搜尋" maxlength="2048" name="q" size="57"></div><br style="line-height:0"><span class="ds"><span class="lsbb"><input class="lsb" value="Google 搜尋" name="btnG" type="submit"></span></span><span class="ds"><span class="lsbb"><input class="lsb" id="tsuid_1" value="好手氣" name="btnI" type="submit"><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){var id='tsuid_1';document.getElementById(id).onclick = function(){if (this.form.q.value){this.checked = 1;if (this.form.iflsig)this.form.iflsig.disabled = false;}
else top.location='/doodles/';};})();</script><input value="AJiK0e8AAAAAYyvbecC6mu0GqbcdsPqqhXDesXTsFjCa" name="iflsig" type="hidden"></span></span></td><td class="fl sblc" align="left" nowrap="" width="25%"><a href="/advanced_search?hl=zh-TW&amp;authuser=0">進階搜尋</a></td></tr></table><input id="gbv" name="gbv" type="hidden" 
value="1"><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){
var a,b="1";if(document&&document.getElementById)if("undefined"!=typeof XMLHttpRequest)b="2";else if("undefined"!=typeof ActiveXObject){var c,d,e=["MSXML2.XMLHTTP.6.0","MSXML2.XMLHTTP.3.0","MSXML2.XMLHTTP","Microsoft.XMLHTTP"];for(c=0;d=e[c++];)try{new ActiveXObject(d),b="2"}catch(h){}}a=b;if("2"==a&&-1==location.search.indexOf("&gbv=2")){var f=google.gbvu,g=document.getElementById("gbv");g&&(g.value=a);f&&window.setTimeout(function(){location.href=f},0)};}).call(this);</script></form><div 
id="gac_scont"></div><div style="font-size:83%;min-height:3.5em"><br></div><span id="footer"><div style="font-size:10pt"><div style="margin:19px auto;text-align:center" id="WqQANb"><a href="http://www.google.com.tw/intl/zh-TW/services/">商業解決方案</a><a href="/intl/zh-TW/about.html">關於 Google</a><a href="https://www.google.com.tw/setprefdomain?prefdom=US&amp;sig=K_Ek7YMfFhDYrLrvRxStxw0qT7zNg%3D" id="fehl">Google.com</a></div></div><p style="font-size:8pt;color:#70757a">&copy; 2022 - <a 
href="/intl/zh-TW/policies/privacy/">隱私權</a> - <a href="/intl/zh-TW/policies/terms/">服務條款</a></p></span></center><script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){window.google.cdo={height:757,width:1440};(function(){
var a=window.innerWidth,b=window.innerHeight;if(!a||!b){var c=window.document,d="CSS1Compat"==c.compatMode?c.documentElement:c.body;a=d.clientWidth;b=d.clientHeight}a&&b&&(a!=google.cdo.width||b!=google.cdo.height)&&google.log("","","/client_204?&atyp=i&biw="+a+"&bih="+b+"&ei="+google.kEI);}).call(this);})();</script> <script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){google.xjs={ck:'xjs.hp.nAt8mkHlvVw.L.X.O',cs:'ACT90oFbSni2diyqWHkZE0NBPwvwM9CUVw',excm:[]};})();</script>  <script nonce="IAJX_v5J-wlSTh5O85tq8w">(function(){var u='/xjs/_/js/k\x3dxjs.hp.en.Pa2FzRQfyWU.O/am\x3dAACeAAAkAEAB/d\x3d1/ed\x3d1/rs\x3dACT90oGGOHAG99qraA9hYMsxdyq3TX3bvQ/m\x3dsb_he,d';
var d=this||self,e=function(a){return a};
var g;var l=function(a,b){this.g=b===h?a:""};l.prototype.toString=function(){return this.g+""};var h={};function n(){var a=u;google.lx=function(){p(a);google.lx=function(){}};google.bx||google.lx()}
function p(a){google.timers&&google.timers.load&&google.tick&&google.tick("load","xjsls");var b=document;var c="SCRIPT";"application/xhtml+xml"===b.contentType&&(c=c.toLowerCase());c=b.createElement(c);if(void 0===g){b=null;var k=d.trustedTypes;if(k&&k.createPolicy){try{b=k.createPolicy("goog#html",{createHTML:e,createScript:e,createScriptURL:e})}catch(q){d.console&&d.console.error(q.message)}g=b}else g=b}a=(b=g)?b.createScriptURL(a):a;a=new l(a,h);c.src=a instanceof l&&a.constructor===l?a.g:"type_error:TrustedResourceUrl";var f,m;(f=(a=null==(m=(f=(c.ownerDocument&&c.ownerDocument.defaultView||window).document).querySelector)?void 0:m.call(f,"script[nonce]"))?a.nonce||a.getAttribute("nonce")||"":"")&&c.setAttribute("nonce",f);document.body.appendChild(c);google.psa=!0};google.xjsu=u;setTimeout(function(){n()},0);})();function _DumpException(e){throw e;}
function _F_installCss(c){}
(function(){google.jl={blt:'none',chnk:0,dw:false,dwu:true,emtn:0,end:0,ine:false,injs:'none',injt:0,injth:0,injv2:false,lls:'default',pdt:0,rep:0,snet:true,strt:0,ubm:false,uwp:true};})();(function(){var pmc='{\x22d\x22:{},\x22sb_he\x22:{\x22agen\x22:true,\x22cgen\x22:true,\x22client\x22:\x22heirloom-hp\x22,\x22dh\x22:true,\x22dhqt\x22:true,\x22ds\x22:\x22\x22,\x22ffql\x22:\x22zh-TW\x22,\x22fl\x22:true,\x22host\x22:\x22google.com.tw\x22,\x22isbh\x22:28,\x22jsonp\x22:true,\x22msgs\x22:{\x22cibl\x22:\x22清除搜尋\x22,\x22dym\x22:\x22你是不是要查:\x22,\x22lcky\x22:\x22好手氣\x22,\x22lml\x22:\x22瞭解詳情\x22,\x22oskt\x22:\x22輸入工具\x22,\x22psrc\x22:\x22已從您的「\\u003Ca href\x3d\\\x22/history\\\x22\\u003E網頁記錄\\u003C/a\\u003E」中移除這筆搜尋記錄\x22,\x22psrl\x22:\x22移除\x22,\x22sbit\x22:\x22以圖搜尋\x22,\x22srch\x22:\x22Google 搜尋\x22},\x22ovr\x22:{},\x22pq\x22:\x22\x22,\x22refpd\x22:true,\x22rfs\x22:[],\x22sbas\x22:\x220 3px 8px 0 rgba(0,0,0,0.2),0 0 0 1px rgba(0,0,0,0.08)\x22,\x22sbpl\x22:16,\x22sbpr\x22:16,\x22scd\x22:10,\x22stok\x22:\x22ZjtxvoHJ9WqhldticDCWCaRXhOg\x22,\x22uhde\x22:false}}';google.pmc=JSON.parse(pmc);})();</script>        </body></html>

結語
今天介紹也使用了requests套件,進行了第一次爬蟲的準備跟實作。
果然每件事情第一次做,都會覺得很新奇很好玩!
接下來會繼續往爬蟲的其他功能(例如資料清洗)前進。/images/emoticon/emoticon08.gif

明天!
【Day 10】第一次資料清洗-Requests HTML

參考資料
Requests 函式庫https://steam.oxxostudio.tw/category/python/spider/requests.html
Day9-簡單套件介紹 Python Requests
Http Header 自定欄位https://medium.com/@BeemoLin/http-header-%E8%87%AA%E5%AE%9A%E6%AC%84%E4%BD%8D-a53b8fd9d6f2
Python 使用 requests 模組產生 HTTP 請求,下載網頁資料教學https://blog.gtwang.org/programming/python-requests-module-tutorial/


上一篇
【Day 8】一起打開瀏覽器檢查看看吧!
下一篇
【Day 10】第一次資料清洗-Requests-HTML
系列文
養爬蟲的人學爬蟲30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言