以天瓏書局的新書頁面為例。
以下為其中一本書的html原始碼。
<li class="single-book">
<a class="cover" href="/products/9786267383926?list_name=r-zh_tw">
<img alt="AI 時代的資料科學:小白到數據專家的全面指南 -cover" src="https://cf-assets2.tenlong.com.tw/products/images/000/213/974/medium/DM2453-AI%E6%99%82%E4%BB%A3%E7%9A%84%E8%B3%87%E6%96%99%E7%A7%91%E5%AD%B8_750x933.jpg?1722478765" />
<span class="label-blue">79折</span>
</a> <div class="pricing">
<del>$1,080</del>
$853
<form class="button_to" method="post" action="/cart?id=9786267383926&list_name=r-zh_tw"><button class="add-to-cart ml-1 px-2 rounded text-tl_blue hover:text-white hover:bg-tl_blue bg-gray-200" type="submit">
<i class="fa fa-shopping-cart"></i>
</button><input type="hidden" name="authenticity_token" value="eRP7yCQixSqiUWZzBkHDKEkOoWnI/MmHa5/d0zGjfHpRjy0Zgo3nMfldFvjOSfGo2aHiynO3HSzNHnsW0tj52g==" /></form> </div>
<strong class="title">
<a title="AI 時代的資料科學:小白到數據專家的全面指南 " href="/products/9786267383926?list_name=r-zh_tw">AI 時代的資料科學:小白到數據專家的全面指南 </a>
</strong>
</li>
如果要爬取這個網站的書本名稱、價錢、打折數:
import requests as req
from openpyxl import Workbook
from bs4 import BeautifulSoup
# 建立 Excel 文件
wb = Workbook()
ws = wb.active
title = ['書名', '原價', '售價', '打折數']
ws.append(title)
# 設定 User-Agent
header = {
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15'
}
import requests as req
from openpyxl import Workbook
from bs4 import BeautifulSoup
# 建立 Excel 文件
wb = Workbook()
ws = wb.active
title = ['書名', '原價', '售價', '打折數']
ws.append(title)
# 設定 User-Agent
header = {
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.6 Safari/605.1.15'
}
for index in range(3):
url = 'https://www.tenlong.com.tw/tw/recent?page='
url = url + str(index+1)
print(url) #輸出範例:https://www.tenlong.com.tw/tw/recent?page=1
re = req.get(url, headers=header)
print(re) #輸出範例:<Response [200]>
#'html.parser' 是用來解析內容的
soup = BeautifulSoup(re.content, 'html.parser')
# 閱覽書籍項目
books = soup.find_all('li', class_='single-book')
for book in books:
#書名
title_tag = book.find('strong', class_='title').find('a')
title = title_tag.get('title')
# print(title)
#原本價格
pricing_tag = book.find('div', class_='pricing').find('del')
pricing = pricing_tag.text.strip()
# print(pricing)
#折後價格
#next_sibling.strip():從一個元素獲取其下一個節點的文本
dis_pricing = '無折扣價格'
if pricing and pricing_tag.next_sibling.strip():
dis_pricing = pricing_tag.next_sibling.strip()
# print(dis_pricing)
#打折數
discount = book.find('span', class_='label-blue')
discount = discount.text.strip() if discount else '無打折數'
# print(discount)
print(f'書名: {title}, 原價: {pricing}, 折後價格: {dis_pricing}, 打折數: {discount}')
#輸出資料到文件中
ws.append([title, pricing, dis_pricing, discount])
# 輸出Excel文件
wb.save('books.xlsx')
<li class="single-book">
<a class="cover" href="/products/A000003?list_name=r-zh_tw">
<img alt="演算法導論, 4/e + Introduction to Algorithms, 4/e (中英文合購)-cover" src="https://cf-assets2.tenlong.com.tw/products/images/000/218/298/medium/%E6%BC%94%E7%AE%97%E6%B3%95%E5%90%88%E8%B3%BC-cover.jpg?1723799183" />
</a> <div class="pricing">
<del>$3,500</del>
$3,500
<form class="button_to" method="post" action="/cart?id=A000003&list_name=r-zh_tw"><button class="add-to-cart ml-1 px-2 rounded text-tl_blue hover:text-white hover:bg-tl_blue bg-gray-200" type="submit">
<i class="fa fa-shopping-cart"></i>
</button><input type="hidden" name="authenticity_token" value="oi2Eblu6Puoyl15b74hGlIDLurOPEU193xnwk3OoL76KsVK//RUc8WmbLtAngHQUEGT5EDRamdZ5mFZWkNOqHg==" /></form> </div>
<strong class="title">
<a title="演算法導論, 4/e + Introduction to Algorithms, 4/e (中英文合購)" href="/products/A000003?list_name=r-zh_tw">演算法導論, 4/e + Introduction to Algorithms, 4/e (中英文合購)</a>
</strong>
</li>
print(f'書名: {title}, 原價: {pricing}, 折後價格: {dis_pricing}, 打折數: {discount}')的結果如下:
wb.save('books.xlsx')輸出的excel檔案如下: