接續前篇,一般爬蟲時抓出的資料量多還沒什麼關係,但這次我們是想要讓使用者在 line 上使用,一次給太多資料總會造成使用者困擾,所以我們需要將抓出來的資料做數量上的限制。
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.atmovies.com.tw/movie/new/')
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text, 'lxml')
filmTitle = soup.select('div.filmTitle a')
content = ""
for data in enumerate(filmTitle):
if i > 10:
break
content += data.text + "\n" + "http://www.atmovies.com.tw/" + data['href'] + "\n\n"
print(content)
這邊是使用 for 迴圈去做數量上的限制,我們也有其他種寫法可以來做到同樣效果,例如只取list 的前十個內容物。
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.atmovies.com.tw/movie/new/')
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text, 'lxml')
filmTitle = soup.select('div.filmTitle a')[:10]
print(filmTitle)
或者改用find_all()
中的limit
限制
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.atmovies.com.tw/movie/new/')
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text, 'lxml')
filmTitle = soup.find_all('div', class="filmTitle", limit=10)
print(filmTitle)
再來就是最後一部,要將爬蟲加入 line bot 內
import os
from flask import Flask, request, abort
from linebot import (
LineBotApi, WebhookHandler
)
from linebot.exceptions import (
InvalidSignatureError
)
from linebot.models import *
import requests
from bs4 import BeautifulSoup
app = Flask(__name__)
line_bot_api = LineBotApi(os.environ.get('CHANNEL_ACCESS_TOKEN'))
handler = WebhookHandler(os.environ.get('CHANNEL_SECRET'))
@app.route("/callback", methods=['POST'])
def callback():
# get X-Line-Signature header value
signature = request.headers['X-Line-Signature']
# get request body as text
body = request.get_data(as_text=True)
app.logger.info("Request body: " + body)
# handle webhook body
try:
handler.handle(body, signature)
except InvalidSignatureError:
print("Invalid signature. Please check your channel access token/channel secret.")
abort(400)
return 'OK'
@handler.add(MessageEvent, message=TextMessage)
def handle_message(event):
if event.message.text == '本周新片':
r = requests.get('http://www.atmovies.com.tw/movie/new/')
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text, 'lxml')
content = []
for i, data in enumerate(soup.select('div.filmTitle a')):
if i > 20:
break
content.append(data.text + '\n' + 'http://www.atmovies.com.tw' + data['href'])
line_bot_api.reply_message(
event.reply_token,
TextSendMessage(text='\n\n'.join(content))
)
if __name__ == "__main__":
app.run()
# 若沒有這部分就要設定環境變數讓 FLASK_APP = app.py 之類的