Day2-Python實作網路爬蟲

2019 iT 邦幫忙鐵人賽

DAY 2

AI & Data

大數據的世代需學會的幾件事系列第 2 篇

2019鐵人賽

queenawu

2018-10-17 23:46:56

7725 瀏覽

分享至

在網路上，常常在網頁中看到成千上萬的數據，要怎麼蒐集這些需要的資料呢?
既然是在網頁上，因此就需要可以解析Web標籤，而在PYTHON中，除了可以利用urllib函示庫外，目前也已有相當多的套件及框架可以方便使用，像是scrapy、BeautifulSoup...等。

urllib函式庫

urllib提供urllib.request、urllib.parse、urllib.error、urllib.robotparser四個模組：

urllib.request
可以用來指定打開、讀取特定URL

import urllib.request
UL = urllib.request.urlopen(r'http://yahoo.com.tw')
print(UL.read(100).decode())
UL.close()

2.urllib.parse

import urllib.request
import import urllib.parse
data = urllib.parse({'name':1,'time':2,'price':3})
data = data.encode('ascii')
with urllib.request.urlopen("https://www.bloomberg.com/quote/SPX:IND",data) as f:
print(f.read().decode('utf-8'))

Beautifulsoup

為目前最普遍、方便網路爬蟲的框架，由於目前BeautifulSoap3已不再維護，因此建議專案需要時，需安裝BeautifulSoap4。
安裝方式：pip install BeautifulSoap4
匯入方式：from bs4 import BeautifulSoap

import urllib.request
from bs4 import BeautifulSoap
doc = requests.get('https://tw.yahoo.com/')
soup = BeautifulSoap(doc,'html_parse')
print(soup.prettify())
print(soup.title)

Beautifulsoup官方文件：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

scrapy

安裝方式：pip install scrapy
匯入方式：import scrapy
安裝完scrapy後，可以透過scrapy startprobject [專案名稱]，來新增一個專案。

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://tw.yahoo.com/']

    def parse(self, response):
        for title in response.css('.post-header>h2'):
            yield {'title': title.css('a ::text').extract_first()}
EOF