URL: https://turnip.exchange/islands
問題.https://turnip.exchange/islands 這頁會在進到主頁前出現另一頁,不是我要抓的頁面,要如何進到下一頁?
<?php
$context = stream_context_create(
array(
"http" => array(
'ignore_errors'=>true,
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
$text=file_get_contents('https://turnip.exchange/islands', false, $context);
//echo $text; //從這裡就可以看出不是我要抓的頁面,要如何進到下一頁?
//preg_match('/Did You Know\?/',$text,$match); //只能抓到載入頁
preg_match('/data-turnip-code=/',$text,$match); //從原始碼看不到載入後的那一頁原始碼,儲存主網頁,離線從儲存的主網頁可以看到相關內容
print_r($match);
?>
原始碼看不到載入後的那一頁原始碼
因為內容是經過JS Render出來的,直接從檢視原始碼是看不到的
如果你要在php取得JS Render後的網頁編碼內容,參考這篇Stackoverflow的作法:
https://stackoverflow.com/questions/28505501/get-the-content-text-of-an-url-after-javascript-has-run-with-php
使用 phantomjs,在JS定義好你要爬的網頁後,在PHP端執行phantomjs的程式碼,取得Render後的網頁編碼內容
用一個php,echo JS Render的網頁,之後透過AJAX延遲一定時間之後,取得Render後的網頁編碼內容
稍為看了一下,他們有 'https://api.turnip.exchange/islands/' 的 endpoint:
(沒水月金弓
應該是沒問題的:
<?php
$url = 'https://api.turnip.exchange/islands/';
$ch = curl_init();
$headers = array(
'Accept: */*',
'Content-Type: application/json',
'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode(array("category"=>"turnips", "islander"=>"neither", 'patreon'=>1)));
$output = curl_exec($ch);
curl_close($ch);
echo $output;
?>
{"success":true,"message":"Island Results!","islands":[{"name":"No Islands","fruit":"apple","turnipPrice":666,"turnipCode":"00000000","hemisphere":"north","islandTime":"2020-04-20T01:00:00.000Z","creationTime":"2020-04-20 01:00:00","description":"No Islands were found with the search results you selected!","queued":999}],"$$time":67388681.10040212}1