2018 iT 邦幫忙鐵人賽

DAY 23

Data Technology

使用Python進行資料分析系列第 23 篇

[Day23]Beautiful Soup網頁解析！

2018鐵人賽 bs4 beautifulsoup

plusone

團隊i.m.a.c_wishing_well

2018-01-11 17:27:00

69016 瀏覽

分享至

哈囉！今天是鐵人賽的第23天！
今天要來說明Beautiful Soup這個Python的套件！

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

什麼是Beautiful Soup？簡單來說它是用來抓取資料（像是HTML或是XML）的工具，你可以用自己習慣的解析程序（parser）在短時間來做到爬資料的工作！

那就開始吧！

HTML

在這邊我們先定義ㄧ個html類型的檔案：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

bs4

import bs4：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

上面這邊我們是使用html.parser這個python的解析程序

prettify()

用prettify()函數將soup這個物件美化，soup就是的html_doc解析的結果，最後print將它印出來：

print(soup.prettify())

美化後的結果：
Imgur

navigate that data structure

接下來我們要開始解析結構了，看一下在html中有標籤，要如何看標籤裡的內容？

soup.title

會得到結果：

<title>The Dormouse's story</title>

取得head：

soup.head

會得到

<head><title>The Dormouse's story</title></head>

可以看到內部還有一層，我們也可以這樣取出title：

soup.head.title

若只想取字串內容：

soup.title.string

結果：

"The Dormouse's story"

find_all()

找出所有的標籤：

soup.find_all('p')

結果：

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

get()

找出所有超連結a的標籤，可以看到a標籤中有一href屬性，用get()就可以取到它的連結位置：

for link in soup.find_all('a'):
    print(link.get('href'))

結果：

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

所以要找出所有標籤內的class名稱，只要：

for className in soup.find_all('p'):
    print(className.get('class'))

結果：

['title']
['story']
['story']

find(id)

依照id去取資料：

soup.find(id="link3")

結果：

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

get_text()

取出文字內容：

print(soup.get_text())

結果：

The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

OK，以上就是Beautiful Soup的基本操作方法！
我所介紹的都是依照官方文件上的範例做講解的，
若有興趣可以到Beautiful Soup Documentation::看更多它如何使用的相關說明以及使用！