[Day25] Python Requests 模組-爬取網站資料

2024 iThome 鐵人賽

DAY 25

自我挑戰組

從零開始學Python系列第 25 篇

16th鐵人賽

pochama

2024-09-15 21:26:54

119 瀏覽

分享至

以天瓏書局的新書頁面為例。
以下為其中一本書的html原始碼。

<li class="single-book">
  <a class="cover" href="/products/9786267383926?list_name=r-zh_tw">

      <img alt="AI 時代的資料科學：小白到數據專家的全面指南 -cover" src="https://cf-assets2.tenlong.com.tw/products/images/000/213/974/medium/DM2453-AI%E6%99%82%E4%BB%A3%E7%9A%84%E8%B3%87%E6%96%99%E7%A7%91%E5%AD%B8_750x933.jpg?1722478765" />

    <span class="label-blue">79折</span>


</a>  <div class="pricing">
      <del>$1,080</del>

      $853
    <form class="button_to" method="post" action="/cart?id=9786267383926&amp;list_name=r-zh_tw"><button class="add-to-cart ml-1 px-2 rounded text-tl_blue hover:text-white hover:bg-tl_blue bg-gray-200" type="submit">
      <i class="fa fa-shopping-cart"></i>
</button><input type="hidden" name="authenticity_token" value="eRP7yCQixSqiUWZzBkHDKEkOoWnI/MmHa5/d0zGjfHpRjy0Zgo3nMfldFvjOSfGo2aHiynO3HSzNHnsW0tj52g==" /></form>  </div>

  <strong class="title">
    <a title="AI 時代的資料科學：小白到數據專家的全面指南 " href="/products/9786267383926?list_name=r-zh_tw">AI 時代的資料科學：小白到數據專家的全面指南 </a>
  </strong>
</li>

如果要爬取這個網站的書本名稱、價錢、打折數：

需要的模組

import requests as req
from openpyxl import Workbook
from bs4 import BeautifulSoup

建立一個excel檔案來存放資料

# 建立 Excel 文件
wb = Workbook()
ws = wb.active

title = ['書名', '原價', '售價', '打折數']
ws.append(title)

設定user-agent來防止爬蟲的網頁阻擋

# 設定 User-Agent
header = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15'
}

寫一個for迴圈來爬取每一頁的資料，並將資料擷取後放入先前建立的excel檔中

import requests as req
from openpyxl import Workbook
from bs4 import BeautifulSoup

# 建立 Excel 文件
wb = Workbook()
ws = wb.active

title = ['書名', '原價', '售價', '打折數']
ws.append(title)

# 設定 User-Agent
header = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15'
}

for index in range(3):
    url = 'https://www.tenlong.com.tw/tw/recent?page='
    url = url + str(index+1)
    print(url) #輸出範例：https://www.tenlong.com.tw/tw/recent?page=1
    re = req.get(url, headers=header)
    print(re) #輸出範例：<Response [200]>

    #'html.parser' 是用來解析內容的
    soup = BeautifulSoup(re.content, 'html.parser')
    # 閱覽書籍項目
    books = soup.find_all('li', class_='single-book')
    for book in books:
        #書名
        title_tag = book.find('strong', class_='title').find('a')
        title = title_tag.get('title')
        # print(title)

        #原本價格
        pricing_tag = book.find('div', class_='pricing').find('del')
        pricing = pricing_tag.text.strip()
        # print(pricing)

        #折後價格
        #next_sibling.strip():從一個元素獲取其下一個節點的文本
        dis_pricing = '無折扣價格'
        if pricing and pricing_tag.next_sibling.strip():
            dis_pricing = pricing_tag.next_sibling.strip()
        # print(dis_pricing)

        #打折數
        discount = book.find('span', class_='label-blue')
        discount = discount.text.strip() if discount else '無打折數'

        # print(discount)
        print(f'書名: {title}, 原價: {pricing}, 折後價格: {dis_pricing}, 打折數: {discount}')

        #輸出資料到文件中
        ws.append([title, pricing, dis_pricing, discount])

# 輸出Excel文件
wb.save('books.xlsx')

中間會遇到如下（沒有折扣的書籍），所以擷取資料的時候需要用if/else來獲取資料

<li class="single-book">
  <a class="cover" href="/products/A000003?list_name=r-zh_tw">

      <img alt="演算法導論, 4/e + Introduction to Algorithms, 4/e (中英文合購)-cover" src="https://cf-assets2.tenlong.com.tw/products/images/000/218/298/medium/%E6%BC%94%E7%AE%97%E6%B3%95%E5%90%88%E8%B3%BC-cover.jpg?1723799183" />

    


</a>  <div class="pricing">
      <del>$3,500</del>

      $3,500
    <form class="button_to" method="post" action="/cart?id=A000003&amp;list_name=r-zh_tw"><button class="add-to-cart ml-1 px-2 rounded text-tl_blue hover:text-white hover:bg-tl_blue bg-gray-200" type="submit">
      <i class="fa fa-shopping-cart"></i>
</button><input type="hidden" name="authenticity_token" value="oi2Eblu6Puoyl15b74hGlIDLurOPEU193xnwk3OoL76KsVK//RUc8WmbLtAngHQUEGT5EDRamdZ5mFZWkNOqHg==" /></form>  </div>

  <strong class="title">
    <a title="演算法導論, 4/e + Introduction to Algorithms, 4/e (中英文合購)" href="/products/A000003?list_name=r-zh_tw">演算法導論, 4/e + Introduction to Algorithms, 4/e (中英文合購)</a>
  </strong>
</li>

print(f'書名: {title}, 原價: {pricing}, 折後價格: {dis_pricing}, 打折數: {discount}')的結果如下：
wb.save('books.xlsx')輸出的excel檔案如下：