iT邦幫忙

0

Python 爬蟲異常 ConnectionError: HTTPSConnectionPool

異常訊息

raise ConnectionError(e, request=request)

ConnectionError: HTTPSConnectionPool(host='e-service.cwb.gov.tw', port=443): Max retries exceeded with url: /HistoryDataQuery/DayDataController.do?command=viewMain&station=C0A9A0&stname=&datepicker=2018-02-27 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000180FB685470>: Failed to establish a new connection: [WinError 10060] 連線嘗試失敗,因為連線對象有一段時間並未正確回應,或是連線建立失敗,因為連線的主機無法回應。'))

網上查解決方法如下ˇ , 加了下面這段還是連一段時間就異常 , 要常常重連才行

while 1:
    try:
        page = requests.get(url)
    except:
        print("Connection refused by the server..")
        print("Let me sleep for 5 seconds")
        print("ZZzzzz...")
        time.sleep(5)
        print("Was a nice sleep, now let me continue...")
        continue

原程式

import datetime as dt  #載入日期時間設定給變數 dt
import requests,time   #載入網頁需求與時間
#設定地址陣列
location_arr=['466910_鞍部','466920_臺北','466930_竹子湖','C0A980_社子','C0A9A0_大直','C0A9B0_石牌','C0A9C0_天母','C0A9E0_士林','C0A9F0_內湖','C0AC40_大屯山','C0AC70_信義','C0AC80_文山','C0AH40_平等','C0AH70_松山','C1AC50_關渡','C1A730_公館','C0A9G0_南港','C0A990_大崙尾山']

# 計算半年時間
startdate = dt.datetime(2018, 2,8)
enddate = dt.datetime(2018, 6,1)
totaldate = (enddate - startdate).days + 1

#在for迴圈內作業。
for daynumber in range(totaldate): # 日期數量迴圈
	datestring = str((startdate + dt.timedelta(days = daynumber)).date())
   # 將日期改成字串
	print(datestring)
	#取得單日_全部地點.htm
    
for i in location_arr:
url="https://e-service.cwb.gov.tw/HistoryDataQuery/DayDataController.do?command=viewMain&station=%s&stname=&datepicker=%s" %(i.split("_")[0],datestring)
		print(url)#列印網頁
		r=requests.get(url)#設定需求網頁變數
		print("Download: "+datestring+"_"+i+".htm")#列印日期迴圈
		with open(datestring+"_"+i+".htm",'w',encoding='utf-8') as f:
			f.write(r.text)
		f.close()
		time.sleep(5) # 讓程序休眠 5 毫秒
        
#解決請求速度過快導致程序報錯      
       
while 1:
    try:
        page = requests.get(url)
    except:
        print("Connection refused by the server..")
        print("Let me sleep for 5 seconds")
        print("ZZzzzz...")
        time.sleep(5)
        print("Was a nice sleep, now let me continue...")
        continue
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

2 個回答

1
froce
iT邦大師 1 級 ‧ 2019-09-04 13:30:19
最佳解答

Max retries exceeded with url

錯誤訊息告訴你了啊,人家伺服器擋掉了。
讓他睡更久吧。

1
ccutmis
iT邦高手 2 級 ‧ 2019-09-04 15:06:36
import datetime as dt  #載入網頁需求與時間
import requests,time,pathlib,os

def log(arg):
	with open('log\\event.log','a',encoding='utf-8') as f:
		f.writelines(arg+'\n')
	f.close()

base_dir = os.path.dirname(os.path.realpath(__file__))
pathlib.Path(base_dir+"\\log\\").mkdir(parents=True, exist_ok=True)

with open('log\\event.log','w',encoding='utf-8') as f:
	pass #這裡用意是把event.log內容清空
f.close()

location_arr=['466910_鞍部','466920_臺北','466930_竹子湖','C0A980_社子','C0A9A0_大直','C0A9B0_石牌','C0A9C0_天母','C0A9E0_士林','C0A9F0_內湖','C0AC40_大屯山','C0AC70_信義','C0AC80_文山','C0AH40_平等','C0AH70_松山','C1AC50_關渡','C1A730_公館','C0A9G0_南港','C0A990_大崙尾山']

startdate = dt.datetime(2019, 2,1)
enddate = dt.datetime(2019, 8,1)
totaldate = (enddate - startdate).days + 1

data_folder='HTML_DATA'
pathlib.Path(data_folder).mkdir(parents=True, exist_ok=True)
#在for迴圈內作業。
for daynumber in range(totaldate):
	datestring = str((startdate + dt.timedelta(days = daynumber)).date())
	print(datestring)
	#取得單日_全部地點.htm
	for i in location_arr:
		url="https://e-service.cwb.gov.tw/HistoryDataQuery/DayDataController.do?command=viewMain&station=%s&stname=&datepicker=%s" %(i.split("_")[0],datestring)
		#print(url)
		try:
			r=requests.get(url)
			print("Download: "+datestring+"_"+i.split("_")[0]+".htm")
			with open(data_folder+"\\"+datestring+"_"+i.split("_")[0]+".htm",'w',encoding='utf-8') as f:
				f.write(r.text)
			f.close()
		except:
			print("Error: "+datestring+"_"+i+".htm")
			log(url)
		finally:
			time.sleep(5)

說明:
不好意思 樓主在之前發問那邊的留言我剛剛才看到
網路爬虫很常見的問題樓主已經遇到了,最基本的標準答案就如同froce邦友所說的,
把time.sleep數字調高,這個是伺服器端防網路爬虫的機制,
我在我的電腦測試time.sleep(5)是不會被擋,樓主如果設(5)會檔,那就設(10)看看。

另外就是同一個對外ip不要同時間跑多個爬虫程式對同一個網站爬資料
(例如同時開五個爬虫去抓同個網站,雖然你設sleep(10),
但伺服器端來看就是同一個ip在極短時間內對網站進行攻擊),
我這邊稍微改寫了一下程式(就是上面的範例),
讓它會把下載的網頁存到'HTML_DATA'資料夾,
如果下載成功會顯示 'Download:...",
如果下載失敗會顯示 'Error:..."
同時會把下載失敗的url存到'log/event.log',
樓主就不用理哪些下載成功哪些下載失敗,
等廻圈跑完一遍,再去開event.log看看哪些沒下載。

漏網之魚沒下載成功的,有'log/event.log'記綠就可以補抓了,
另外寫個python小程式,把event.log讀進來變成一個陣列,
然後用這陣列去重新跑一遍爬虫流程即完成。
範例如下:

import requests,time,pathlib,os,re

def get_url_filename(arg):
	return re.sub(r'^.*?station=(.*?)&stname=&datepicker=(.*?)$',r'\2_\1',arg)

def log(arg):
	with open('log\\event2.log','a',encoding='utf-8') as f:
		f.writelines(arg+'\n')
	f.close()

base_dir = os.path.dirname(os.path.realpath(__file__))
pathlib.Path(base_dir+"\\log\\").mkdir(parents=True, exist_ok=True)

with open('log\\event2.log','w',encoding='utf-8') as f:
	pass #這裡用意是把event2.log內容清空
f.close()

#讀入event.log文字檔轉成陣列 url_array
log_file = open("log\\event.log", "r")
url_array = log_file.read().split('\n')
log_file.close()

data_folder='HTML_DATA'
pathlib.Path(data_folder).mkdir(parents=True, exist_ok=True)
#在for迴圈內作業。
for i in url_array:
	if i!='':
		try:
			r=requests.get(i)
			print("Download: "+get_url_filename(i)+".htm")
			with open(data_folder+"\\"+get_url_filename(i)+".htm",'w',encoding='utf-8') as f:
				f.write(r.text)
			f.close()
		except:
			print("Error: "+get_url_filename(i)+".htm")
			log(i)
		finally:
			time.sleep(5)

這個回覆是補完樓主在另一篇文的留言提問
/images/emoticon/emoticon82.gif

我要發表回答

立即登入回答