我需要爬取foodpanda網站做作業
之前完全沒問題 就在剛剛突然無法獲得資料
於是我改變程式碼尋找問題
import bs4
import requests
def foodpanda( city_name):
#建立縣市相對應網址字典
city = {"台北市":"taipei-city", "新北市":"new-taipei-city", "台中市":"taichung-city","高雄市":"kaohsiung-city",
"新竹市":"hsinchu-city", "桃園市":"taoyuan-city", "基隆市":"keelung","台南市":"tainan-city",
"苗栗市":"miaoli-county", "嘉義市":"chiayi-city", "彰化市":"changhua", "宜蘭縣":"yilan-city",
"屏東縣":"pingtung-city", "雲林縣":"yunlin-county", "花蓮市":"hualien", "南投市":"nantou-county",
"台東市":"taitung-county","澎湖縣":"penghu-city", "金門縣":"kinmen-city"}
if city_name in city:
city_url = city[city_name]
url = "https://www.foodpanda.com.tw/city/"+city_url
#如果輸入名稱在字典內,取得相對應網址
header = {"User-Agent":"Moziilla/5.0 (Windows NT 6.1; WOW64)\AppleWebKit/537.6 (KHTML, like Gecko) Chrome/45.0.2454.101\
Safari/537.36"}
url = requests.get(url, headers = header)
#下載網頁
search = bs4.BeautifulSoup(url.text, "lxml")
#解析下載後的網頁
print(search.text)
foodpanda("台北市")
結果是
Please verify you are a human
Access to this page has been denied because we believe you are using automation tools to browse the website.
This may happen as a result of the following:
Javascript is disabled or blocked by an extension (ad blockers for example)
Your browser does not support cookies
Please make sure that Javascript and cookies are enabled on your browser and that you are not blocking them from loading.
Reference ID: #b9b6dc10-d32d-11eb-b71e-e1e76f2daa6e
Powered by PerimeterX, Inc.
請問我該如何解決
import requests
def foodpanda( city_name):
#建立縣市相對應網址字典
city = {"台北市":"taipei-city", "新北市":"new-taipei-city", "台中市":"taichung-city","高雄市":"kaohsiung-city",
"新竹市":"hsinchu-city", "桃園市":"taoyuan-city", "基隆市":"keelung","台南市":"tainan-city",
"苗栗市":"miaoli-county", "嘉義市":"chiayi-city", "彰化市":"changhua", "宜蘭縣":"yilan-city",
"屏東縣":"pingtung-city", "雲林縣":"yunlin-county", "花蓮市":"hualien", "南投市":"nantou-county",
"台東市":"taitung-county","澎湖縣":"penghu-city", "金門縣":"kinmen-city"}
if city_name in city:
city_url = city[city_name]
url = "https://www.foodpanda.com.tw/city/"+city_url
#如果輸入名稱在字典內,取得相對應網址
header = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-TW,zh;q=0.8,en-US;q=0.5,en;q=0.3",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
}
session = requests.Session()
url = session.get(url, headers = header)
print(url.text)
foodpanda("台北市")
你要裝瀏覽器也裝的像一點...
然後不要重複送太多次,並且間隔拉長,要不然人家又要和你捉迷藏了。
你這只是被人家抓到你是爬蟲被拒絕了。
解決的方式你要自行去研究看看。
目前我只能幫你推測可能有做了一些安全機制了。怎麼樣的安全機制就得要查看了。
先試著從header下手,再增加能取得COOKIE的方法。
或許就可行了。
不過我沒空幫你研究。爬蟲如何破解人家的防護也是一門學問的。
爬蟲 與反爬蟲 是 矛盾之爭
我特別愛接這種案子 一陣子就能又接改版案子
你被對方識破是爬蟲了
建議 1. 偽裝瀏覽器是否正確 cookie js
2. 是否超過爬行次數(太多)
3. 換台機器 及IP 試試
4. 用不同手法(model )試試
防爬蟲,你只能改學selenium + beautifulSoap套件了
我需要爬取foodpanda網站做作業
因為真的很急 加上我還沒學其他的 可能來不及了
山不轉路轉
手動用瀏覽器下載
也不過十來個 URL