關於Python下載PDF檔

python crawler download

liwei290341 2018-11-04 23:02:58 ‧ 6400 瀏覽

分享至

各位大大們你們好，小弟我剛踏入寫程式這個坑不到一個半月的時間，目前正在學習爬蟲還有下載檔案。
過去曾成功下載過<國北>的考古題，但是我發現有一些考古題並未被下載下來，找了半天才發現原來並非所有的pdf檔前都是以http為開頭，所以便呈現出無法讀取，僅有http開頭的pdf被我下載了。

當遇到這樣的問題時該如何解決呢?
以下為我的程式碼

還請各位多幫忙，謝謝!!

登入發表討論

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

1 個回答

淺水員

iT邦大師 6 級 ‧ 2018-11-05 00:33:36

當前網頁：http://academic.ntue.edu.tw/files/11-1007-467.php
如果連結的網址是「/」開頭，代表實際網址要前面加上 http://academic.ntue.edu.tw
如果連結的網址開頭就直接是英文字，那麼實際網址要加上 http://academic.ntue.edu.tw/files/

回應 5
分享
檢舉

看更多先前的回應...收起先前的回應...

liwei290341 iT邦新手 5 級 ‧ 2018-11-05 21:17:20 檢舉

好的我會去試試看!!
謝謝指導!

liwei290341 iT邦新手 5 級 ‧ 2018-11-05 21:30:50 檢舉

更改程式碼後，那些[/]開頭的連結還是下載不下來，想請您幫我確認一下我是否有打錯的地方：

import requests,os
from bs4 import BeautifulSoup
from urllib.request import urlopen
#from urlparse import urljoin

url = 'http://academic.ntue.edu.tw/files/11-1007-467.php'
html = requests.get(url)
html.encoding='utf-8'


sp=BeautifulSoup(html.text,"html.parser")

#建立目錄
pdf_dir="pdfs/"
if not os.path.exists(pdf_dir):
     os.mkdir(pdf_dir)


links=sp.find_all("a")
for link in links:
    href=link.get("href")
    attrs=[href]          
    for attr in attrs:
        if href != None and href.startswith("http://academic.ntue.edu.tw"):
          full_path = attr
          filename = full_path.split('/')[-1]
          print(full_path)
          try:
              pdf = urlopen(full_path)
              f = open(os.path.join(pdf_dir,filename),'wb')
              f.write(pdf.read())
              f.close()
          except:
               print ("{} 無法讀取!".format(filename))

淺水員 iT邦大師 6 級 ‧ 2018-11-05 23:33:44 檢舉

links=sp.find_all("a")
for link in links:
	href=link.get("href")
	#不用再多一層 attr 迴圈，href 已經是連結了
	
	#如果不是 .pdf 結尾，直接跳過不處理
	if(href == None or href.split('.')[-1]!='pdf'):
		continue;
	
	#關於網址的處理
	if(href[0:4]=='http'):
		full_path = href
	elif(href[0]=='/'): #這邊要處理開頭為 / 的網址
		full_path = "http://academic.ntue.edu.tw" + href
	else: #其他的是相對路徑
		full_path= "http://academic.ntue.edu.tw/files/" + href
	print(full_path)
	
	#後面就是抓 full_path 並儲存

另外提醒一下，他有不同資料夾但是相同檔名的狀況
所以如果單純用

filename = full_path.split('/')[-1]

作為儲存的檔名，會有後面覆蓋前面的情形

liwei290341 iT邦新手 5 級 ‧ 2018-11-07 22:14:08 檢舉

似乎還是有點問題不過有進展了我再研究看看先謝大哥!

liwei290341 iT邦新手 5 級 ‧ 2018-11-14 18:10:14 檢舉

都成功了，感謝解答! 感激不盡!

登入發表回應

我要發表回答

立即登入回答

參賽組數

902 組

團體組數

37 組

累計文章數

19864 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙