Day_21 nokogiri ?

第 12 屆 iThome 鐵人賽

DAY 21

自我挑戰組

Ruby on Rails 新手的30個問題！系列第 21 篇

12th鐵人賽 ruby on rails nokogiri rubygem

Yuan

團隊貓肥加潤五倍祝福

2020-09-28 15:15:42

2834 瀏覽

分享至

嗨！各位朋友大家好，打給後，歹嘎吼，胎尬喉，我是阿圓，一樣有請今天的one piece：

(nokogiri!就是日文的鋸子！)

今天繼續來介紹好用的 gem ，今天要介紹的是！nokogiri !
是一種可以幫忙將網站上的資訊給"割"下來的套件！

一般來說，我們在實作爬蟲網站的時候，會去使用別人的API，拿回一包JSON，取出特定的值，再重新排列在頁面上，而 nokogiri 是拿回一整包的HTML結構，在使用 parser 來取用到 HTML 中的 tag！

安裝方法

gem install nokogiri
或是在專案的 gemfile 裡
gem 'nokogiri', '~> 1.6', '>= 1.6.8'
然後bundle install

接著，在你要使用 nokogiri 的rb檔案裡，

require 'nokogiri'
require 'open-uri'

這樣就可以開始使用 nokogiri 的功能了！

使用方法

取回 document(網頁原始碼中`<html>`標籤裡的東西)

有好幾種方式：

#從 html 或 xml 檔案
html_doc = Nokogiri::HTML( File.open("test.html") )
xml_doc  = Nokogiri::XML( File.open("test.xml") )
#從 url
url = 'http://www.google.com/search?q=sparklemotion'
doc = Nokogiri::HTML(open( url ))

取用特定區塊

舉一個例子，先新增一個ruby的檔案，內容為：

require 'nokogiri'
require 'open-uri'

htmlData = "
<html>
	<title> This is a simple html </title>
	<body id='story_body'>
		<h2> this is h2 in story_body </h2>
	</body>
	<h1> test h1-1 </h1>
	<h1> test h1-2 </h1>
	<h3>
		<img src = 'goodPic-1.jpg' >
		<a href = 'www.google.com'> google web site </a>
		<img src = 'goodPic-2.jpg' > 
		<a href = 'www.yahoo.com'> yahoo web site </a>
	</h3>
	<div class= 'div_1'>
		<h2> this is h1 in div_1 </h2>
	</div>
</html>
"
doc = Nokogiri::HTML( htmlData )

接著來操作用 nokogiri 取到的doc：

puts doc.xpath("//h1")
# 取到全部的 h1 標籤，回傳為一個陣列：
# <h1> test h1-1 </h1>
# <h1> test h1-2 </h1>

puts doc.xpath('//h3/a')
#取出所有在 h3 下的 a，回傳為一個陣列
# <a href="www.google.com"> google web site </a>
# <a href="www.yahoo.com"> yahoo web site </a>

puts doc.xpath('//h3/a').text 
# google web site  yahoo web site 

# 也可以取用 tag 的屬性值
puts doc.xpath("//h3/a")[0]['href'] # www.google.com
puts doc.xpath("//h3/img")[0]['src'] # goodPic-1.jpg 

# 也可以用＠跟//組合
puts doc.xpath( "//@href" )
#www.google.com
#www.yahoo.com

# 找出特定 id or class 的 tag
puts doc.xpath("//div[@class='div_1']")
#<div class="div_1">
#    <h2> this is h1 in div_1 </h2>
#</div>

能夠取到從網站上"割"下來的東西，就能在自己的新作的頁面上做重新排列。
使用 nokogiri ，讓這件事情變得很簡單。

感謝各位看到這邊，若有任何建議，請各位不吝指教！我們明天見！