從前幾天的爬蟲實做,我們總算將課程查詢網站爬蟲相關實做的部份告一個段落了,那在今日,我們要做的事情是,將前幾天下來所拿到的課程清單做一個擷取的動作,拿到我們所預期的每個課程相關的資訊。
首先,跟之前一樣,先將爬蟲要用到的環境透過下面的指令跑起來。
docker run --name=php_crawler -d -it php_crawler bash
若命名重複,記得先將此命名移除。
docker rm php_crawler
接著用自己偏好程式編輯器打開lab2-1-fetch.php
並把程式碼內容放到下面中:
<?php
require_once __DIR__ . '/vendor/autoload.php';
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
$publicCourses = 'https://infosys.nttu.edu.tw/n_CourseBase_Select/CourseListPublic.aspx';
$headers = [
'Host' => 'infosys.nttu.edu.tw',
'Connection' => 'keep-alive',
'Cache-Control' => 'max-age=0',
'Upgrade-Insecure-Requests' => '1',
'Sec-Fetch-Mode' => 'navigate',
'Sec-Fetch-User' => '?1',
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8 application/signed-exchange;v=b3',
'Sec-Fetch-Site' => 'none',
'Referer' => 'https://infosys.nttu.edu.tw/',
'Accept-Encoding' => 'gzip, deflate, br',
'Accept-Language' => 'zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'User-Agent' => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13',
];
$client = new Client(['cookies' => true]);
$response = $client->request('GET', $publicCourses, [
'headers' => $headers,
]);
$publicCourseString = (string)$response->getBody();
$viewState = '__VIEWSTATE';
$eventValidation = '__EVENTVALIDATION';
$viewStateGenerator = '5D156DDA';
$crawler = new Crawler($publicCourseString);
$crawler
->filter('input[type="hidden"]')
->reduce(function (Crawler $node, $i) {
global $viewState;
global $eventValidation;
if ($node->attr('name') === $viewState) {
$viewState = $node->attr('value');
}
if ($node->attr('name') === $eventValidation) {
$eventValidation = $node->attr('value');
}
});
$formParams = [
'form_params' => [
'ToolkitScriptManager1' => 'UpdatePanel1|Button3',
'ToolkitScriptManager1_HiddenField' => '',
'DropDownList1' => '1081',
'DropDownList6' => '1',
'DropDownList2' => '%',
'DropDownList3' => '%',
'DropDownList4' => '%',
'TextBox9' => '',
'DropDownList5' => '%',
'DropDownList7' => '%',
'TextBox1' => '',
'DropDownList8' => '%',
'TextBox6' => '0',
'TextBox7' => '14',
'__EVENTTARGET' => '',
'__EVENTARGUMENT' => '',
'__LASTFOCUS' => '',
'__VIEWSTATE' => $viewState,
'__VIEWSTATEGENERATOR' => $viewStateGenerator,
'__SCROLLPOSITIONX' => '0',
'__SCROLLPOSITIONY' => '0',
'__EVENTVALIDATION' => $eventValidation,
'__VIEWSTATEENCRYPTED' => '',
'__ASYNCPOST' => 'false',
'Button3' => '查詢',
],
'headers' => [
'Sec-Fetch-Mode: cors',
'Origin: https://infosys.nttu.edu.tw',
'Accept-Encoding: gzip, deflate, br',
'Accept-Language: zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7',
'X-Requested-With: XMLHttpRequest',
'Connection: keep-alive',
'X-MicrosoftAjax: Delta=true',
'Accept: */*',
'Cache-Control: no-cache',
'Referer: https://infosys.nttu.edu.tw/n_CourseBase_Select/CourseListPublic.aspx',
'Sec-Fetch-Site: same-origin',
'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
],
];
$response = $client->request('POST', $publicCourses, $formParams);
$coursesString = (string)$response->getBody();
$crawler = new Crawler($coursesString);
$orders = [
'mandatory',
'class',
'category',
'number',
'name',
'outline',
'credit',
'people_limit',
'people_least',
'people_select_course',
'people_course',
'teacher',
'teach_time',
'teach_place',
'advanced_course',
'merged_class',
'p.s',
'course_limit_info',
'special_course',
];
$courses = [
'mandatory' => [],
'class' => [],
'category' => [],
'number' => [],
'name' => [],
'outline' => [],
'credit' => [],
'people_limit' => [],
'people_least' => [],
'people_select_course' => [],
'people_course' => [],
'teacher' => [],
'teach_time' => [],
'teach_place' => [],
'advanced_course' => [],
'merged_class' => [],
'p.s' => [],
'course_limit_info' => [],
'special_course' => [],
];
$crawler
->filter('tr[class="NTTU_GridView_Row"] td')
->reduce(function (Crawler $node, $i) {
global $courses;
global $orders;
$index = $i % 19;
$text = str_replace([' ', "\n", "\r"], '', $node->text());
$courses[$orders[$index]][] = $text;
});
var_dump($courses);
上述擷取作法如下:
$orders
變數之前,基本上是擷取108
第一學期的課程列表之第一個分頁。DOM Crawler
中,解析的CSS selector
為tr[class="NTTU_GridView_Row"] td
。19
個,剛好回呼函數$i
是序號,因此將此序號去除以19
取餘數就是目前的值對應的欄位名稱,並放到那個欄位鍵值中的陣列之中。以上就是本日擷取課程內容的方式,接著就是大家會注意到有一個欄位叫做「教學大綱」,這個欄位其實是額外的一個外部課綱連結,代表這個課程相關概要與大綱內容。
這也需要拿到對應的連結,所以明日需要做的事情是,要實做拿到每個課程對應課綱連結,至於課綱連結裡面內容擷取,就不太需要了,紀錄課綱連結即可。