爬蟲問題-子標籤看不懂

網頁爬蟲

chifeng 2019-09-23 05:49:06 ‧ 3480 瀏覽

分享至

新手發問，還請各位高手協助
以下是我想從圖書館爬取的資料，我已經可以找到書名：深入淺出Python
我利用
title = soup.select_one('span.briefcitTitle a')
print(title.text)
順利抓下書名，但我也想要將其他欄位資訊，如作者、出版社、出版年、借閱到期日等，要如何爬取？
例如作者前面<br />不是斷行的意思？？子標籤如何辨識？？

謝謝高手協助！！

<!-- Right Result rank 1 -->
<tr  class="browseSuperEntry browseEntryRelGroup1"><td colspan="1"><img src="/screens/relevance5.gif" alt="最有關的">&nbsp;最有關的 標題&nbsp;條目 1-184</td></tr>
<!-- Rel 2007 "Skyline" Example Set -->
<!-- This File Last Changed: 15 February 2008 -->
<tr>
<td class="briefCitRow">
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr valign="top">
<td width="40" align="center" class="briefcitEntry" >
<div class="briefcitEntryNum">
<a name='anchor_1'></a> 1<!--this is customized <screens/briefcit_cht.html>-->
</div>
<div class="briefcitMedia">
 <img src="/screens/media_book.gif" alt="印刷型資料"></div>
<input type="checkbox" 
name="save" value="b2214278" >
</td>
<td align="left" class="briefcitDetail">
<!--{nohitmsg}-->
<span class="briefcitTitle">
<a href="/search~S2*cht?/Xpython&searchscope=2&SORT=DZ/Xpython&searchscope=2&SORT=DZ&extended=0&SUBKEY=python/0%2C291%2C291%2CB/frameset&FF=Xpython&searchscope=2&SORT=DZ&1%2C1%2C">深入淺出Python</a></span>
<br />
巴瑞 (Barry, Paul)<br />
臺北市 : 碁峰資訊, 2019[民108]<!--
<div>
2019</div>
-->
<br/>
<span>評級:</span>

<span id="rategroup1"><a href="/patroninfo~S2*cht/0/redirect=/search~S2*cht?/Xpython&searchscope=2&SORT=DZ/Xpython&searchscope=2&SORT=DZ&extended=0&SUBKEY=python/0%2C291%2C291%2CB/browse#anchor_1"><img src="/screens/rate_no.gif" border="0" width="75" height="14" title="No one has rated this material" /></a>

</span><div class="briefcitRequest">
<a href="/search~S2*cht?/Xpython&searchscope=2&SORT=DZ/Xpython&searchscope=2&SORT=DZ&extended=0&SUBKEY=python/0%2C291%2C291%2CC/requestbrowse~b2214278&FF=Xpython&searchscope=2&SORT=DZ&1%2C1%2C"><img src="/screens/bullet.gif" alt="Bullet Point" border="0" style="margin-right:5px"/>預約</a></div>
<span class="briefcitStatus">
<font color="red">
</font>
</span>
<div class="briefcitActions">
&nbsp;</div>
<div class="briefcitItems">
<table width="100%" border="0" cellspacing="1" cellpadding="2" class="bibItems">
<tr  class="bibItemsHeader">
<th width="18%"  class="bibItemsHeader">
館藏地
</th>
<th width="13%"  class="bibItemsHeader">
條碼
</th>
<th width="20%"  class="bibItemsHeader">
索書號
</th>
<th width="18%"  class="bibItemsHeader">
冊次
</th>
<th width="6%"  class="bibItemsHeader">
年代
</th>
<th width="25%"  class="bibItemsHeader">
狀態 (<a href="/screens/status_description_cht.html" target="blank"><font color="BLUE">說明</font></a>)
</th>
</tr>
<tr  class="bibItemsEntry">

<td width="18%" ><!-- field 1 -->&nbsp;1F 中文書庫 
</td>
<td width="13%" ><!-- field b -->&nbsp;C322014 </td>
<td width="20%" ><!-- field C -->&nbsp;312.32P97 550-2 2019 <a href="/search~S2*cht?/l312.32P97+550-2+2019/l312.32p97++++++550-+++++++2/-3,-1,,B/browse">《鄰近架位館藏》</a></td>
<td width="18%" ><!-- field v --></td>
<td width="6%" ><!-- field d -->&nbsp;2019 </td>
<td width="25%" ><!-- field % -->&nbsp;到期 19-09-23 +1 已催還 </td></tr>
</table>

</div>
</td>
<td align="center" width="5%">
<a href="http://findbook.tw/book/9789864769902/basic" target="_parent"><img src="/bookjacket?recid=b2214278&size=0" border="0" alt="書本封面"></a></td>
</tr>
</table>
</td>
</tr>

看更多先前的討論...收起先前的討論...

ccutmis iT邦高手 2 級 ‧ 2019-09-23 07:48:31 檢舉

soup的選擇器跟CSS的一樣，如果你不懂CSS選擇器的運作原理可以看這邊
http://www.runoob.com/cssref/css-selectors.html
理解CSS選擇器的運作原理之後就可以用
soup.select('tr.bibItemsHeader th.bibItemsHeader')
取得 '館藏地條碼索書號冊次年代狀態'....的串列(List)，
因為這邊的html結構是
＜tr class="bibItemsHeader"＞
＜th class="bibItemsHeader"＞...＜/th＞
＜th class="bibItemsHeader"＞...＜/th＞
…
＜th class="bibItemsHeader"＞...＜/th＞
＜th class="bibItemsHeader"＞...＜/th＞
＜/tr＞
最近鐵人賽有不少python跟css的好文可以去找來看
ps.我個人是不用soup的，我偏好用RegEXP...^~^"