Day25 Beautiful Soup Try Out: Stepstone Posting 美麗的湯爬蟲初體驗：達石職缺

第 11 屆 iThome 鐵人賽

DAY 25

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 25 篇

11th鐵人賽 pandas beautifulsoup python web scraping

kyt

2019-09-26 06:39:37

2941 瀏覽

分享至

初次嘗試使用美味的湯爬資料，先做小一點的試試水。今天是從德國求職網站達石來下載職缺列表，先試看看不翻頁只爬第一頁100筆職缺訊息。
Today is my first try on BeautifulSoup, so the goal is to scrape 100 job posting on one page from Stepstone. This code doesn't contain page looping.

import requests
from bs4 import BeautifulSoup

# 指定網址 specify the url
url = "https://www.stepstone.de/5/job-search-simple.html?stf=freeText&ns=1&companyid=0&sourceofthesearchfield=resultlistpage%3Ageneral&qs=%5B%5D&ke=Junior%20Data%20Scientist&ws=Berlin&ra=10&suid=b830ebdc-e1ed-43cf-931b-006b0ad341c5&li=100&of=0&action=per_page_changed"
resp = requests.get(url)

resp.encoding = 'utf-8' # 轉換編碼至UTF-8 transform encoding to UTF-8

# 顯示網頁狀態，200即為正常 show the page status, code 200 means the page works just fine 
resp.status_code

# 創建一個BeautifulSoup物件 create a BeautifulSoup object
soup = BeautifulSoup(resp.content, 'html.parser')

從網頁上用檢查看起來是以article來分別每一筆職缺訊息的，印出第一筆來看看長怎樣

After cheking the code, we found that it seems like Stepstone saves each job posting using article. Print out the first one to have a look.

listing = soup.find_all('article')
print(listing[0])

把每個職缺名稱存成清單。

Use .find_all() to save the job position into list.

job_list = soup.find_all('h2', attrs={'class': 'styled__TitleWrapper-sc-7z1cau-1 dPEGKL'})
jobs = []
for j in job_list:
    job = j.text.strip()
    jobs.append(job)
print(jobs[0:3])

把每個職缺的公司存成清單。

Use .find_all() to save the company name into list.

company_list = soup.find_all('div', attrs={'class': 'styled__CompanyName-iq4jvn-0 gakwWs'})
company = []
for c in company_list:
    comp = c.text.strip()
    company.append(comp)
print(company[0:3])

把每個職缺的位置存成清單。

Use .find_all() to save the location into list.

location_list = soup.find_all('li', attrs={'class': 'job-element__body__location styled__IconElement-sc-1k0l2ot-1 jUROsL'})
location = []
for l in location_list:
    locat = l.text.strip()
    location.append(locat)
print(location[0:3])

把每個職缺的簡述存成清單。

Use .find_all() to save the short discriptions into list.

a = soup.find_all('a', attrs={'class': 'styled__TextSnippetLink-sc-1xzea7b-1 styled__OneLineTextSnippetLink-sc-1xzea7b-2 bIjIzo'})
description = []
for i in a:
    des = i.find('span').text.strip()
    description.append(des)
print(description[0:3])
len(description) # 確認筆數沒有錯 check if the post amount is correct

把上面的清單存成字典，轉成資料框架，再存成csv檔。

Transform the lists we created above into dictionaries then into dataframe. After that, save as csv file.

import pandas as pd
data = {'Jobs':jobs, 'Company':company, 'Location':location, 'Description':description}
df = pd.DataFrame(data)
df.head()

df.to_csv('df.csv')

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正，感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料：

[1] Tutorial: Python Web Scraping Using BeautifulSoup

[2] Stepstone

Day24 Airbnb in Berlin 5/5 the ring zone summary 柏林Airbnb 5/5 蛋黃區房源分析小結

Day26 Stepstone Posting 達石職缺

系列文

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作共 30 篇

RSS系列文訂閱系列文

26 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

9325 篇

完賽人數

95 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 17th鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react

IT邦幫忙

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 25 篇