iT邦幫忙

第 11 屆 iT 邦幫忙鐵人賽

DAY 25
0
AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 25

Day25 Beautiful Soup Try Out: Stepstone Posting 美麗的湯爬蟲初體驗:達石職缺

初次嘗試使用美味的湯爬資料,先做小一點的試試水。今天是從德國求職網站達石來下載職缺列表,先試看看不翻頁只爬第一頁100筆職缺訊息。
Today is my first try on BeautifulSoup, so the goal is to scrape 100 job posting on one page from Stepstone. This code doesn't contain page looping.
https://ithelp.ithome.com.tw/upload/images/20190925/20119709NFOZh4JktC.jpg

import requests
from bs4 import BeautifulSoup
# 指定網址 specify the url
url = "https://www.stepstone.de/5/job-search-simple.html?stf=freeText&ns=1&companyid=0&sourceofthesearchfield=resultlistpage%3Ageneral&qs=%5B%5D&ke=Junior%20Data%20Scientist&ws=Berlin&ra=10&suid=b830ebdc-e1ed-43cf-931b-006b0ad341c5&li=100&of=0&action=per_page_changed"
resp = requests.get(url)

resp.encoding = 'utf-8' # 轉換編碼至UTF-8 transform encoding to UTF-8

# 顯示網頁狀態,200即為正常 show the page status, code 200 means the page works just fine 
resp.status_code 

https://ithelp.ithome.com.tw/upload/images/20190925/20119709FzLdv6lEda.jpg

# 創建一個BeautifulSoup物件 create a BeautifulSoup object
soup = BeautifulSoup(resp.content, 'html.parser')

從網頁上用檢查看起來是以article來分別每一筆職缺訊息的,印出第一筆來看看長怎樣

After cheking the code, we found that it seems like Stepstone saves each job posting using article. Print out the first one to have a look.
https://ithelp.ithome.com.tw/upload/images/20190925/20119709CoZg7Jiz4e.jpg

listing = soup.find_all('article')
print(listing[0])

https://ithelp.ithome.com.tw/upload/images/20190925/20119709CNzjoiOJiQ.jpg

把每個職缺名稱存成清單。

Use .find_all() to save the job position into list.

job_list = soup.find_all('h2', attrs={'class': 'styled__TitleWrapper-sc-7z1cau-1 dPEGKL'})
jobs = []
for j in job_list:
    job = j.text.strip()
    jobs.append(job)
print(jobs[0:3])

https://ithelp.ithome.com.tw/upload/images/20190925/20119709KORYYMr1Zb.jpg

把每個職缺的公司存成清單。

Use .find_all() to save the company name into list.

company_list = soup.find_all('div', attrs={'class': 'styled__CompanyName-iq4jvn-0 gakwWs'})
company = []
for c in company_list:
    comp = c.text.strip()
    company.append(comp)
print(company[0:3])

https://ithelp.ithome.com.tw/upload/images/20190925/20119709BzXOTxwa7P.jpg

把每個職缺的位置存成清單。

Use .find_all() to save the location into list.

location_list = soup.find_all('li', attrs={'class': 'job-element__body__location styled__IconElement-sc-1k0l2ot-1 jUROsL'})
location = []
for l in location_list:
    locat = l.text.strip()
    location.append(locat)
print(location[0:3])

https://ithelp.ithome.com.tw/upload/images/20190925/20119709QE8ZPSVW0e.jpg

把每個職缺的簡述存成清單。

Use .find_all() to save the short discriptions into list.

a = soup.find_all('a', attrs={'class': 'styled__TextSnippetLink-sc-1xzea7b-1 styled__OneLineTextSnippetLink-sc-1xzea7b-2 bIjIzo'})
description = []
for i in a:
    des = i.find('span').text.strip()
    description.append(des)
print(description[0:3])
len(description) # 確認筆數沒有錯 check if the post amount is correct

https://ithelp.ithome.com.tw/upload/images/20190925/20119709N2ywSc3Q6S.jpg

把上面的清單存成字典,轉成資料框架,再存成csv檔。

Transform the lists we created above into dictionaries then into dataframe. After that, save as csv file.

import pandas as pd
data = {'Jobs':jobs, 'Company':company, 'Location':location, 'Description':description}
df = pd.DataFrame(data)
df.head()

https://ithelp.ithome.com.tw/upload/images/20190925/20119709trPLXGGFRF.jpg

df.to_csv('df.csv')

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料:

[1] Tutorial: Python Web Scraping Using BeautifulSoup

[2] Stepstone


上一篇
Day24 Airbnb in Berlin 5/5 the ring zone summary 柏林Airbnb 5/5 蛋黃區房源分析小結
下一篇
Day26 Stepstone Posting 達石職缺
系列文
Hands on Data Cleaning and Scraping 資料清理與爬蟲實作30

尚未有邦友留言

立即登入留言