iT邦幫忙

第 11 屆 iThome 鐵人賽

DAY 21
0
AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 21

Day21 Airbnb in Berlin 2/5 listings overview 柏林Airbnb 2/5 房源概述

今天從Inside Airbnb下載的資料(listing.csv),針對德國柏林地區的Airbnb房源初步分析。

The data (listing.csv) was collected from Inside Airbnb, the data was last updated on 11/07/2019.
Today's article will briefly analysise the house listing of Airbnb in Berlin.

載入常用套件並讀入我們要分析的資料

First, we need to import the packeges we need and read in the data we are about to analyse.

# 載入所需套件 import the packages we need
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt # 畫互動式圖表的開源套件 graphing library makes interactive graphs
import seaborn as sns
import plotly as py 

import warnings # 忽略警告訊息 
warnings.filterwarnings("ignore") 

讀入listing檔案來分析

Read in the listing file

listing = pd.read_csv('airbnb/listings.csv') # 讀入listing檔案來分析 read in the listing file
print('There are', listing.id.nunique(), 'listings in the listing data.')
listing.info() # 查看資料細節 the info of data
listing.head(3) # 叫出前三筆資料看看 print out the top three rows of data

https://ithelp.ithome.com.tw/upload/images/20190922/20119709jCa5fB8B7H.jpg

印出listing數量前10的區域

Print out the areas with top ten amounts of listings

# 對數量做排序並只取前10存成grouped_df sort the listing and only save the top 10 listing in to grouped_df
grouped_df = listing.groupby('neighbourhood_cleansed').count()[['id']].sort_values('id', ascending=False).head(10) 
grouped_df.plot(kind='bar', title='Areas wite the top 10 Listing numbers')

https://ithelp.ithome.com.tw/upload/images/20190922/20119709Ii4BIeujNu.png

grouped_df

https://ithelp.ithome.com.tw/upload/images/20190922/201197094uhQJIjY5W.jpg

畫出顧客評分圖

Plot out the review scores

# 去缺失值並畫出顧客評分圖 drop missing values then plot out the review scores 
plt.figure(figsize=(12 , 6))
plt.title('The review scores are pretty high in general.', fontsize=15)
sns.distplot(listing.review_scores_rating.dropna(), rug=True)
sns.despine()

https://ithelp.ithome.com.tw/upload/images/20190922/20119709VnpstAkE5S.png

看看listing檔案中價格分布

Check out the price range of listings

listing['price'] = listing['price'].astype(str).str.replace(',', '').astype(str).str.replace('$', '').astype(float) 

print(listing.price.describe()) # 印出一些價格分布數值 get an intuition of what the data look like
plt.figure(figsize = (12, 6))
plt.title('Listing price', fontsize=15)
sns.distplot(listing.price.dropna(), rug=True)
sns.despine()

https://ithelp.ithome.com.tw/upload/images/20190922/20119709H4VuctEmGn.jpg
https://ithelp.ithome.com.tw/upload/images/20190922/20119709qvSeeddavm.png

去除離群值

Plot without outliers

# 去除離群值 plot without outliers
plt.figure(figsize=(12 , 6))
plt.title('The normal housing rate', fontsize=15)
sns.distplot(listing[listing.price<300].price.dropna(), rug=True)
sns.despine()

https://ithelp.ithome.com.tw/upload/images/20190922/20119709RcWw5f4trU.png

畫出可接受價位區間的資料

Plot out the data with reasonable price

# 畫出可接受價位區間的資料 plot out the data with reasonable price
plt.figure(figsize=(12,6))
listing.loc[(listing.price<100)&(listing.price>30)].price.hist(bins=20)
plt.ylabel('Count')
plt.xlabel('Listing price')
plt.title('Listings with Acceptable Price')

https://ithelp.ithome.com.tw/upload/images/20190922/201197098lVdd3gaEF.png

畫出各區域價位區間

Plot out price range of different areas

drop_outlier_price_condition = listing.loc[(listing.price<=100)&(listing.price>40)]
sort_price = drop_outlier_price_condition\
        .groupby('neighbourhood_cleansed')['price']\
        .median()\
        .sort_values(ascending=False)\
        .index
# 柏林共133的區域,由於區域眾多,先分四圖表畫出來看看
# due to the large numbers of areas in Berlin(133), plotted into 4 plots.
plt.figure(figsize=(18,6))   
plt.title('Price range of different areas 1/4', fontsize=16)
sns.boxplot(y='price', x='neighbourhood_cleansed', data=drop_outlier_price_condition, order=sort_price[:34])
plt.xticks(rotation=-90)

plt.figure(figsize=(18,6))   
plt.title('Price range of different areas 2/4', fontsize=16)
sns.boxplot(y='price', x='neighbourhood_cleansed', data=drop_outlier_price_condition, order=sort_price[34:67])
plt.xticks(rotation=-90)

plt.figure(figsize=(18,6))   
plt.title('Price range of different areas 3/4', fontsize=16)
sns.boxplot(y='price', x='neighbourhood_cleansed', data=drop_outlier_price_condition, order=sort_price[67:100])
plt.xticks(rotation=-90)

plt.figure(figsize=(18,6))   
plt.title('Price range of different areas 4/4', fontsize=16)
sns.boxplot(y='price', x='neighbourhood_cleansed', data=drop_outlier_price_condition, order=sort_price[100:])
plt.xticks(rotation=-90)

https://ithelp.ithome.com.tw/upload/images/20190922/20119709dXJ0HHzEOH.png
https://ithelp.ithome.com.tw/upload/images/20190922/20119709lHa0NBI8mF.png
https://ithelp.ithome.com.tw/upload/images/20190922/20119709eW9GVd5R8q.png
https://ithelp.ithome.com.tw/upload/images/20190922/20119709gibbXASf2m.png

不同類型住房對價格的影響

The relation between different property types and price

def boxplot_to_price(category_name):
    sort_price = drop_outlier_price_condition\
                .groupby(category_name)['price']\
                .median()\
                .sort_values(ascending=False)\
                .index
    plt.figure(figsize=(18,6))
    plt.title(category_name +' effects', fontsize=16)
    sns.boxplot(y='price', x=category_name, data=drop_outlier_price_condition, order=sort_price)
    plt.xticks(rotation=45)
boxplot_to_price('property_type')

https://ithelp.ithome.com.tw/upload/images/20190922/20119709jWnYajQ29Y.png

drop_outlier_price_condition.pivot(columns='property_type', values='price').plot(kind='box')
plt.xticks(rotation=-90)

https://ithelp.ithome.com.tw/upload/images/20190922/20119709jLq20e50dy.png

住房包含設施數量前20名

The top 20 amenities listings contain

listing['amenities'] = listing.amenities.str.replace('[{}]', '').str.replace('"', '')
listing.amenities.head()
all_item_ls = np.concatenate(listing.amenities.map(lambda am:am.split(',')))
Top20_item = pd.Series(all_item_ls).value_counts().head(20)
plt.figure(figsize=(18 , 6))
Top20_item.plot(kind='bar')
plt.xticks(rotation=45)

https://ithelp.ithome.com.tw/upload/images/20190922/20119709WiVHTabd5V.png

住房包含設施數量後20名

The bottom 20 amenities listings contain

amenities = np.unique(np.concatenate(listing['amenities'].map(lambda amns: amns.split(","))))
amenity_prices = [(amn, listing[listing['amenities'].map(lambda amns: amn in amns)]['price'].mean()) for amn in amenities if amn != ""]
amenity_srs = pd.Series(data=[a[1] for a in amenity_prices], index=[a[0] for a in amenity_prices])
plt.figure(figsize=(16,8))
amenity_srs.sort_values(ascending=False)[:20].plot(kind='bar')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=12)
plt.show()

https://ithelp.ithome.com.tw/upload/images/20190922/20119709wIprViAJd2.png

床的數量與價格關係

The relation between nembers of beds and price.

plt.figure(figsize=(18,6))
sns.boxplot(y='price', x='beds', data=drop_outlier_price_condition)
plt.show()

https://ithelp.ithome.com.tw/upload/images/20190922/201197096qV0lDQwnw.png

本日文章針對柏林房源先有個初步認識,接下來幾篇文章再進一步分析。
Today we briefly walked through the listing data of Airbnb listings in Berlin, in the following articles, we will step a little deeper into analysing the listings.

Code Reference

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料:

[1] Inside Airbnb

[2] 利用Airbnb來更了解居住城市,以臺北為例 Python實作(上)

[3] Airbnb listings in Berlin


上一篇
Day20 Airbnb in Berlin 1/5 booking rate 柏林Airbnb 1/5 訂房率
下一篇
Day22 Airbnb in Berlin 3/5 the ring zone 柏林Airbnb 3/5 蛋黃區
系列文
Hands on Data Cleaning and Scraping 資料清理與爬蟲實作30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言