DAY 07 : request_header_cookie 通過網頁18限制

第 11 屆 iThome 鐵人賽

DAY 7

AI & Data

蟲王養成 - scrapy系列第 7 篇

11th鐵人賽

kevin8701111

團隊NUTC_IMAC_GREEN

2019-09-23 14:44:12

1956 瀏覽

分享至

先前發文
DAY 01 : 參賽目的與規劃
 DAY 02 : python3 virtualenv 建置
 DAY 03 : python3 request
DAY 04 : 使用beautifulsoup4 和lxml
DAY 05 : select 和find 抓取tag
DAY 06 : soup解析後 list取值
DAY 07 : request_header_cookie 通過網頁18限制

今天要來解決下圖板的頁面問題, url : https://www.ptt.cc/bbs/Gossiping/index.html

點進去後你會發現有18歲的確認button

按下同意後檢查頁面ctrl + shift + i 切換到Network 並選取index.html

查看到request header 裡的 cookie: 就可以發現 over18=1

那就可以使用requests.get 並帶入url 和cookies 進行網頁的請求

如果不帶的話就只是得到跳轉後 18歲確認的頁面而已

    article = requests.get(
            url = h_url,
            cookies = {'over18': 'yes'}  # ptt18歲的認證
        )

接下來一樣進行解析並爬取資料達成拿取板被所有的發文與基本資料

    soup = BeautifulSoup(article.text,'lxml')
    r_ent = soup.select('div.r-ent')[0].text
    a_url = soup.select('div.title > a')[0]['href']
    a_title = soup.select('div.title')[0].text
    print(a_title)
    a_author = soup.select('div.author')[0].text
    print(a_author)
    a_date = soup.select('div.date')[0].text
    print(a_date)
    print('https://www.ptt.cc/'+a_url)