先前發文
DAY 01 : 參賽目的與規劃
DAY 02 : python3 virtualenv 建置
DAY 03 : python3 request
DAY 04 : 使用beautifulsoup4 和lxml
DAY 05 : select 和find 抓取tag
DAY 06 : soup解析後 list取值
DAY 07 : request_header_cookie 通過網頁18限制
DAY 08 : ppt內文爬取
DAY 09 : 資料處理 split replace strip
DAY 10 : python csv 寫入 和dict 合併
DAY 11 : python class function
DAY 12 : crawl 框架 scrapy 使用
DAY 13 : scrapy 架構
DAY 14 : scrapy pipeline data insert mongodb
DAY 15 : scrapy middleware proxy
首先提一下為何要使用proxy ip 代理呢!
原因就是爬蟲對web的伺服器會造成負擔 , 固維護web的人會想出很多方法
來避免自己的web伺服器遭到攻擊或是高負載 , 那比較常見的就是requests header 裡面
並沒有偽裝成瀏覽器的user-agent
或是user-agent
的版本過低 , 會被瀏覽器拒絕連線
另外還有scrapy setting.py 裡的 ROBOTSTXT_OBEY
如果是true
也會造成拒絕連線
一樣在setting裡的COOKIES_ENABLED
如果是False
雖然不會造成拒絕連線 , 但是沒有
cookie的網頁 ,有些資料是得不到的
但是比較麻煩的是ip問題 , 通過ip去禁止連線訪問是許多網站會做的防範措施之一
那想要偽裝ip就只能從架構圖上去看 , middleware可以做到這件事
首先要有一群ip來讓我們替換
from bs4 import BeautifulSoup
import scrapy
class scrapyProxy(scrapy.Spider):
name ='proxy'
allowed_domains = ['www.us-proxy.org']
url = 'https://httpbin.org/ip'
start_urls = ['https://www.us-proxy.org/']
custom_settings = {
'ITEM_PIPELINES': {
'new_arrival.pipelines.new_arrival_pipelines.GoogleSheetPipeline': None,
'new_arrival.pipelines.new_arrival_pipelines.NikeCrawlPipeline':322
}
}
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
trs = soup.select("#proxylisttable tr")
for tr in trs:
tr_soup = BeautifulSoup(str(tr), 'lxml')
tds = tr_soup.select("td")
if len(tds) > 6:
ip = tds[0].text
port = tds[1].text
anonymity = tds[4].text
ifScheme = tds[6].text
if ifScheme == 'yes':
scheme = 'https'
else: scheme = 'http'
proxy = "%s://%s:%s"%(scheme, ip, port)
# if anonymity != 'anonymous' and scheme == 'https':
if scheme == 'https':
#三種模式
# meta = {
# 'port': port,
# 'proxy': proxy,
# 'dont_retry': True,
# 'download_timeout': 3,
# '_proxy_scheme': scheme,
# '_proxy_ip': ip,
# }
yield {
'scheme': scheme,
'proxy': proxy,
'port': port,
# 'scheme': response.meta['_proxy_scheme'],
# 'proxy': response.meta['proxy'],
# 'port': response.meta['port']
}
上面的code跟一般的爬蟲一樣 , 存成json
class RandomProxyMiddleware(HttpProxyMiddleware):
def __init__(self, auth_encoding="latin-1", proxy_list_file=None):
print('RandomProxyMiddleware')
if not proxy_list_file:
raise NotConfigured
self.auth_encoding = auth_encoding
self.proxies = defaultdict(list)
with open(proxy_list_file) as f:
proxy_list = json.load(f)
for proxy in proxy_list:
scheme = proxy["scheme"]
url = proxy["proxy"]
if self._get_proxy(url, scheme) not in self.proxies[scheme]:
self.proxies[scheme].append(self._get_proxy(url, scheme))
@classmethod
def from_crawler(cls, crawler):
auth_encoding = crawler.settings.get("HTTPPROXY_AUTH_ENCODING", "latin-1")
proxy_list_file = crawler.settings.get("PROXY_LIST_FILE")
return cls(auth_encoding, proxy_list_file)
def _set_proxy(self, request, scheme):
creds, proxy = random.choice(self.proxies[scheme])
request.meta["proxy"] = proxy
print(':::::::', request.meta["proxy"], ':::::::')
if creds:
request.headers["Proxy-Authorization"] = b"Basic" + creds
通過拿取json ip 並帶入request header讓瀏覽器認證
'DOWNLOADER_MIDDLEWARES':{
'new_arrival.middlewares.RandomProxyMiddleware':745
}
PROXY_LIST_FILE = '/home/kevin/Git/new_arrival_spiders/new_arrival/nike_data_export/proxy.json'
設定settings.py裡的middlewares就可以達到random ip的作用