今天是第10天,接續昨天的內容繼續說明爬PTT及說明如何下載圖片到本機。
先來複習一下昨天的程式碼
get_article_content()是我們今天要新增的內容。每抓到一個href我們就丟到此function去爬取該文章內容。import requests
from bs4 import BeautifulSoup
url="https://www.ptt.cc/bbs/Food/index.html"
    
def get_all_href(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "lxml")
    results = soup.select("div.title")
    for item in results:
        a_item = item.select_one("a")
        if a_item:
            get_article_content(article_url='https://www.ptt.cc'+a_item.get('href'))
    print('------------------ next page ------------------')
    
for page in range(1,4):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "lxml")
    btn = soup.select('div.btn-group > a')
    if btn:
        next_page_url = 'https://www.ptt.cc' + btn[3]['href']
        url = next_page_url
        print('page:', url)
        get_all_href(url = url) 
接著就來撰寫爬取文章內容的部分啦!
def get_article_content(article_url):
    r = requests.get(article_url)
    soup = BeautifulSoup(r.text, "lxml")
    ...
接著,先隨便點進去一篇文章內,點擊右鍵檢查:
可以看到作者, 標題, 時間都在<span class="article-meta-value"裡面,我們就在get_article_content裡面寫:
select所有class=article-meta-value的內容,因為抓取到會是一個list,也不難發現0,1,2,3index各為作者, 看板, 標題, 時間。r = requests.get(article_url)
    soup = BeautifulSoup(r.text, "lxml")
    results = soup.select('span.article-meta-value')
    if results:
        print('作者:', results[0].text)
        print('看板:', results[1].text)
        print('標題:', results[2].text)
        print('時間:', results[3].text)
應該就可以看到有抓到這幾個欄位了:
接著來說明如何下載圖片吧!先來看看下面的範例:
import requests
import shutil
img_url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
img_name = 'google'
r = requests.get(img_url, stream=True)
file_name = img_name
with open('./' + file_name + '.jpg', 'wb+') as out_file:
    shutil.copyfileobj(r.raw, out_file)
https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png 為 google的logo,至於怎麼得到右鍵 > 複製圖片連結就可以得到,就以這張圖為範例,透過requests.get()向圖片網址發出請求,並儲存response中raw的內容為.jpg檔案格式。
因此,我們可以再新增下載圖片的function在程式碼中:
圖片連結, 圖片名稱就行了!image資料夾def download_img_from_article(img_url, img_name):
    r = requests.get(img_url, stream=True)
    file_name = str(img_name + 1)
    print( 'save img to  ./image/'+ file_name + '.jpg')
    try:
        with open('./image/' + file_name + '.jpg', 'wb') as out_file:
            shutil.copyfileobj(r.raw, out_file)
    except:
        print('can not save img', img_url)
那如何抓到圖片連結呢?
<a>標籤,並判斷.jpg(或多判斷.jpeg)是否包含在該href裡面。image_count的計數當作圖片名稱。image_count = 0
imgs = soup.findAll('a')
        for img in imgs:
            if '.jpg' in img['href']:
                download_img_from_article(img_url=img['href'], img_name = image_count)
                image_count += 1
這樣爬PTT就大功告成了!第一次的實戰練習就到這裡結束了!