使用urllib.request.urlopen疑問

網頁爬蟲

ysjhs50048 2019-04-28 23:37:10 ‧ 5085 瀏覽

分享至

各位好
我是最近開始學習爬蟲的初學者，自己寫了一段代碼，想爬取104的工作列表，代碼如下:

import urllib.request
url="http://www.104.com.tw/jobs/search/?ro=0&order=11&asc=0&page=1&mode=s&jobsource=2018indexpoc&indArea=8018000000,8020000000,8083000000,8019000000"
headers=("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
urllib.request.install_opener(opener)
file=opener.open(url)
print(file.getcode())
da=file.read().decode("utf-8","ignore")
print(da)

但是結果卻顯示另一個網址的網頁原始碼https://tls.support.104.com.tw/
說瀏覽器版本過舊需要升級瀏覽器，想請問有沒有方法可以解決?
我用的是chrome瀏覽器的User-Agent

froce iT邦大師 1 級 ‧ 2019-04-29 07:52:06 檢舉

https://docs.python.org/3/library/ssl.html

然後如果想避開這些問題，不想碰這麼底層的東西，python也有requests和requests-html這些好用的工具。
urllib算是很低階的lib了。

ccutmis iT邦高手 2 級 ‧ 2019-04-29 07:59:12 檢舉

#建議改用requests.get，底下是個簡單的示範，更完整的你可以上網找或是買書看...

import requests
url="https://www.104.com.tw/jobs/search/?ro=0&order=11&asc=0&page=1&mode=s&jobsource=2018indexpoc&indArea=8018000000,8020000000,8083000000,8019000000%22"
r=(requests.get(url)).text
r=r.replace("/t","").replace(" ","").replace("\r\n","").replace("\n","")
print(r)

#接下來可以用lxml,beautifulSoup,Re等工具去解析html的內容得到你要的

登入發表討論

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友回答

立即登入回答

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

使用urllib.request.urlopen疑問

尚未有邦友回答

標記使用者