Day 7：案例研究 1-1 解析吧！解析更多的RSS內容

11th鐵人賽

peter279k

2019-09-22 23:56:06

2165 瀏覽

分享至

前言

在前一天中，我們知道了該如何拿到「訊息標題」，但是這不算夠的，我認為要拿到下列才可以把訊息重要資訊擷取起來。

「內容」
「訊息標題」
「訊息發怖時間」
「訊息連結」
「發怖訊息單位」

那上面這些對應到RSS內容的標籤如下：

description
title
pubDate
link
author

實做擷取

首先，我們先把好幾天前的爬蟲開發Docker image環境跑在背景並把其取個名字叫做「php_crawler」。

為了怕「php_crawler」這個名字已經有用過了，我們可以使用下列的指令先把這個名字做一個刪除

docker rm php_crawler

docker run --name=php_crawler -d -it php_crawler bash

接著，使用docker ps查看我們的爬蟲開發環境已經在背景執行了，如果有的話，可以看到類似下面的輸出訊息。

 lee  /data/ithome-lab  docker ps                                         1048ms
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
1eb1e04767b0        php_crawler         "bash"              3 seconds ago       Up 1 second                             php_crawler

接著使用熟悉的程式編輯器，打開「lab1-1.php」並把程式碼改成下面的樣子：

<?php

require_once __DIR__ . '/vendor/autoload.php';

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

$latestNews = 'https://www.nttu.edu.tw/p/503-1000-1009.php';
$client = new Client();
$response = $client->request('GET', $latestNews);

$latestNewsString = (string)$response->getBody();

$titles = [];
$descriptions = [];
$pubDates = [];
$links = [];
$authors = [];

$crawler = new Crawler($latestNewsString);

$crawler
    ->filter('title')
    ->reduce(function (Crawler $node, $i) {
        global $titles;
        $titles[] = $node->text();
    });

$crawler
    ->filter('description')
    ->reduce(function (Crawler $node, $i) {
        global $descriptions;
        $descriptions[] = $node->text();
    });

$crawler
    ->filter('pubDate')
    ->reduce(function (Crawler $node, $i) {
        global $pubDates;
        $pubDates[] = $node->text();
    });

$crawler
    ->filter('link')
    ->reduce(function (Crawler $node, $i) {
        global $links;
        $links[] = $node->text();
    });

$crawler
    ->filter('author')
    ->reduce(function (Crawler $node, $i) {
        global $authors;
        $authors[] = $node->text();
    });

var_dump($descriptions);
var_dump($pubDates);
var_dump($links);
var_dump($authors);
var_dump($titles);

接著把lab1-1.php複製進正在運行的Docker container中

docker cp lab1-1.php php_crawler:/root/

接著再進入這個叫做「php_crawler」的Docker container之中

docker exec -it php_crawler bash                   350ms
root@1eb1e04767b0:~#

接著執行「lab1-1.php」就會看到下面的輸出結果：

array(11) {
  [0]=>
  string(45) "僅秘書室與人事室能發布置頂公告"
  [1]=>
  string(61) "本獎學金與其它獎學金不衝突,無重覆請領疑慮"
  [2]=>
  string(1165) "<p>108學年度第一學期忠孝樓申請搬遷回一宿公告</p>
<table border="0" cellpadding="0" cellspacing="0" style="border-collapse:collapse;width:328px;" width="327">
<colgroup>
<col style="width:85px;" />
<col style="width:243px;" />
</colgroup>
<tbody>
<tr height="68" style="height:68px;">
<td height="68" style="height:68px;width:85px;">說明</td>
<td style="border-left:none;width:243px;">目前一宿因休學、退學、轉學、放棄住宿而釋出床位，因此開放忠孝樓學生申請搬回一宿</td>
</tr>
<tr height="29" style="height:29px;">
<td height="29" style="height:29px;border-top:none;width:85px;">名額</td>
<td style="border-top:none;border-left:none;width:243px;">女生16床男生15床</td>
</tr>
<tr height="22" style="height:22px;">
<td height="22" style="height:22px;border-top:none;width:85px;">房型</td>
<td style="border-top:none;border-left:none;width:243px;">4人房</td>
</tr>
<tr height="44" style="height:44px;">
<td height="44" style="height:44px;border-top:none;width:85px;">申請時間</td>
<td style="border-top:none;border-left:none;width:243px;">9月16日中午12時00分~9...</td></tr></tbody></table>"
  [3]=>
  string(1731) "<div><strong><span style="color:#0000ff;"><span style="font-size:1.625em;"><span style="background-color:#ffff00;">各位同學大家好：</span></span></span></strong></div>
<div><strong><span style="color:#0000ff;"><span style="font-size:1.625em;"><span style="background-color:#ffff00;">你們有學伴嗎？</span></span></span></strong></div>
<div><strong><span style="color:#0000ff;"><span style="font-size:1.625em;"><span style="background-color:#ffff00;">你們有外國學伴嗎？</span></span></span></strong></div>
<div><strong><span style="color:#0000ff;"><span style="font-size:1.625em;"><span style="background-color:#ffff00;">開學就是要交冰友～</span></span></span></strong></div>
<div><strong><span style="color:#0000ff;"><span style="font-size:1.625em;"><span style="background-color:#ffff00;">請到國際事務中心辦公室（行政大樓一樓電梯後面直直走）</span></span></span></strong><strong><span style="color:#0000ff;"><span style="font-size:1.625em;"><span style="background-color:#ffff00;">填寫你想要的參加的時段~</span></span></span></strong></div>
<div><strong><span style="color:#0000ff;"><span style="font-size:1.625em;"><span style="background-color:#ffff00;"></span></span></span></strong><span style="font-size: 1.625em;"><strong><span style="color:#0000ff;"><span style="background-color:#ffff00;">快點來唷～</span></span></strong></span></div>
<div><span style="font-size: 1.625em;"><br />
<span style="color:#ff0000;"><strong>*俄語與日語地點變更，請已預約的同學暨得到新地點喔！</strong></span></span></div>
<div> </div>
<div> </div>
<div><span style="font-size:1.625em;"><strong>洽詢電話：0...</strong></span></div>"
  [4]=>
  string(303) "<p>109學年國立臺東大學僑生及港澳生單獨招生</p>
<p>報名繳件時程：2019年10月4日(五)上午9:00起至11月15日(五)下午5:00止</p>
<p> </p>
<p>報名系統 Apply online:<a href="http://isenroll.nttu.edu.tw/" title="報名系統"> http://isenroll.nttu.edu.tw/</a></p>
..."
  [5]=>
  string(637) "<div><strong><span style="color:#ff0000;"><span style="font-size:1.75em;">2020 春季班報名繳件截止日期  2019 年 9 月 29 日</span></span></strong></div>
<div> </div>
<div><span style="color:#ff0000;"><span style="font-size:1.5em;"><strong>Application Deadlines for Spring Semester : 29 September 2019</strong></span></span></div>
<div> </div>
<div> </div>
<div> </div>
<div><span style="color:#0000ff;"><strong><span style="font-size:1.5em;">報名系統</span></strong></span></div>
<div><span style="color:#0000ff;"><strong><span style="font-size:1.5em;">Apply onlin...</span></strong></span></div>"
  [6]=>
  string(114) "<p><iframe frameborder="0" height="800" scrolling="no" src="/var/file/8/1008/img/375/206939753.pdf" width="100%..."
  [7]=>
  string(127) "<p><iframe frameborder="0" height="850" scrolling="no" src="/var/file/2/1002/img/1351/500124601.pdf" width="100%"></iframe></p>"
  [8]=>
  string(127) "<p><iframe frameborder="0" height="850" scrolling="no" src="/var/file/2/1002/img/1351/399377793.pdf" width="100%"></iframe></p>"
  [9]=>
  string(128) "<p><iframe frameborder="0" height="850" scrolling="no" src="/ var/file/2/1002/img/1351/632824446.pdf" width="100%"></iframe></p>"
  [10]=>
  string(152) "<p><iframe frameborder="0" height="900" scrolling="no" src="https://enews.nttu.edu.tw/var/file/45/1045/img/740/433611221.pdf" width="100%"></iframe></p>"
}
array(11) {
  [0]=>
  string(19) "2019-09-22 13:48:05"
  [1]=>
  string(19) "2019-09-19 00:00:00"
  [2]=>
  string(19) "2019-09-16 00:00:00"
  [3]=>
  string(19) "2019-09-12 00:00:00"
  [4]=>
  string(19) "2019-09-12 00:00:00"
  [5]=>
  string(19) "2019-09-09 00:00:00"
  [6]=>
  string(19) "2019-09-06 00:00:00"
  [7]=>
  string(19) "2019-09-04 00:00:00"
  [8]=>
  string(19) "2019-09-03 00:00:00"
  [9]=>
  string(19) "2019-09-03 00:00:00"
  [10]=>
  string(19) "2019-09-03 00:00:00"
}
array(11) {
  [0]=>
  string(22) "http://www.nttu.edu.tw"
  [1]=>
  string(45) "https://wdsa.nttu.edu.tw/p/404-1009-51852.php"
  [2]=>
  string(45) "https://wdsa.nttu.edu.tw/p/404-1009-91305.php"
  [3]=>
  string(43) "https://rd.nttu.edu.tw/p/404-1007-91275.php"
  [4]=>
  string(43) "https://rd.nttu.edu.tw/p/404-1007-91187.php"
  [5]=>
  string(43) "https://rd.nttu.edu.tw/p/404-1007-91140.php"
  [6]=>
  string(44) "https://dga.nttu.edu.tw/p/404-1008-91078.php"
  [7]=>
  string(43) "https://aa.nttu.edu.tw/p/404-1002-90906.php"
  [8]=>
  string(43) "https://aa.nttu.edu.tw/p/404-1002-90908.php"
  [9]=>
  string(43) "https://aa.nttu.edu.tw/p/404-1002-90907.php"
  [10]=>
  string(46) "https://enews.nttu.edu.tw/p/404-1045-90881.php"
}
array(10) {
  [0]=>
  string(15) "學生事務處"
  [1]=>
  string(15) "學生事務處"
  [2]=>
  string(15) "研究發展處"
  [3]=>
  string(15) "研究發展處"
  [4]=>
  string(15) "研究發展處"
  [5]=>
  string(9) "總務處"
  [6]=>
  string(9) "教務處"
  [7]=>
  string(9) "教務處"
  [8]=>
  string(9) "教務處"
  [9]=>
  string(15) "東大新聞網"
}
array(11) {
  [0]=>
  string(27) "臺東大學 - 重要消息"
  [1]=>
  string(94) "【學務處課外組】107學年度第2學期學業績優班級前三名獎學金申請公告"
  [2]=>
  string(57) "【學務處生輔組】忠孝樓遷回一宿申請公告"
  [3]=>
  string(103) "【研發處】Language Corner 語言交流夥伴活動~開始接受預約，亦歡迎現場報名唷^^"
  [4]=>
  string(115) "【研發處】109學年僑生及港澳生單獨招生簡章及申請時間公告(2020年9月入學，限學士班)"
  [5]=>
  string(127) "【研發處】外國學生申請入學(2020年春季班)現正報名中 International Student Admissions(Spring Semester 2020)"
  [6]=>
  string(92) "【總務處出納組】108學年度第一學期(進修學士班新生)繳交學雜費公告"
  [7]=>
  string(65) "【教務處】核發108-1舊生續領設籍臺東獎學金公告"
  [8]=>
  string(95) "【教務處】大一優秀新生獎學金、設籍臺東獎學金(含轉學新生)申請公告"
  [9]=>
  string(84) "【教務處】大一新生「運動、美術、音樂」績優獎學金申請公告"
  [10]=>
  string(46) "【秘書室】東大簡訊-13號刊(20190903)"
}

期待的結果跟內容其實都有寫到指定的陣列裡面去了，那會看到description標籤內容的部份，每個擷取出來的訊息內容仍有大量的「HTML」標籤的內容。為什麼？原因是回去看RSS消息內容，可以發現RSS在對於「description」標籤中，直接把內容與含有「HTML」標籤全部塞進去。

如果要再從這些「HTML」標籤擷取只有文字的訊息說明，如果不想做二次解析，可以考慮把所有的「HTML」標籤移除。

那我們把在「lab1-1.php」的程式碼中負責解析「description」標籤的程式改成下面這樣：

<?php

require_once __DIR__ . '/vendor/autoload.php';

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

$latestNews = 'https://www.nttu.edu.tw/p/503-1000-1009.php';
$client = new Client();
$response = $client->request('GET', $latestNews);

$latestNewsString = (string)$response->getBody();

$descriptions = [];
$crawler = new Crawler($latestNewsString);

$crawler
    ->filter('description')
    ->reduce(function (Crawler $node, $i) {
        global $descriptions;
        $descriptions[] = str_replace([" ", "\n", "\r", "\t"], "", strip_tags($node->text()));
    });

var_dump($descriptions);

注意到了嘛？我們可以發現我們使用了「strip_tags」把所有的「HTML」標籤移除並使用str_replace把\r，\n，\t以及都替換成空白字串。

接著就會變成下面的輸出結果：

array(11) {
  [0]=>
  string(45) "僅秘書室與人事室能發布置頂公告"
  [1]=>
  string(61) "本獎學金與其它獎學金不衝突,無重覆請領疑慮"
  [2]=>
  string(266) "108學年度第一學期忠孝樓申請搬遷回一宿公告說明目前一宿因休學、退學、轉學、放棄住宿而釋出床位，因此開放忠孝樓學生申請搬回一宿名額女生16床男生15床房型4人房申請時間9月16日中午12時00分~9..."
  [3]=>
  string(333) "各位同學大家好：你們有學伴嗎？你們有外國學伴嗎？開學就是要交冰友～請到國際事務中心辦公室（行政大樓一樓電梯後面直直走）填寫你想要的參加的時段~快點來唷～*俄語與日語地點變更，請已預約的同學暨得到新地點喔！  洽詢電話：0..."
  [4]=>
  string(204) "109學年國立臺東大學僑生及港澳生單獨招生報名繳件時程：2019年10月4日(五)上午9:00起至11月15日(五)下午5:00止 報名系統Applyonline:http://isenroll.nttu.edu.tw/..."
  [5]=>
  string(161) "2020春季班報名繳件截止日期 2019年9月29日 ApplicationDeadlinesforSpringSemester:29September2019   報名系統Applyonlin..."
  [6]=>
  string(0) ""
  [7]=>
  string(0) ""
  [8]=>
  string(0) ""
  [9]=>
  string(0) ""
  [10]=>
  string(0) ""
}

其他的擷取方式也是相同的，就不依依贅述，也是以此類推。

同樣的網站有類似的新聞網址有下列如下：