bs4的find()被當成str的find()無法解決

bs4 find() python

the_rabbit 2022-12-14 16:41:57 ‧ 1123 瀏覽

分享至

我最近在學如何使用BS4時遇到這個問題
body = inner.find("div" ,itemprop = "articleBody")被報說錯
find() takes no keyword arguments

改成find_all後又被說str沒有這個attribute

但是我單把這項拿出來另開個檔案測試卻又抓得到原本要抓的內文

請問個為大大如何解決這問題?
import bs4
root=bs4.BeautifulSoup(data,"html.parser")#data是透過網路抓下來的資料(html原始碼)丟給bs4會用html解析
titleLinks = root.find_all("div",class_="c-articleItem__title")

page = root.find("a",class_="c-pagination c-pagination--next")
for titleLink in titleLinks:
    titles = titleLink.a.text
    articleLink = "https://www.mobile01.com/" + titleLink.a["href"]
    ws.cell(i,1,i)
    ws.cell(i,2,titles)
    ws.cell(i,3,articleLink)
    mWeb.save("mobile.xlsx")
    request=req.Request(articleLink,headers={
        "User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Mobile Safari/537.36"
    })
    with req.urlopen(request) as response:

        inner = response.read().decode("utf-8")
        body = inner.find("div" ,itemprop = "articleBody")
        article = body.text
        ws.cell(i,6,article)
        print(article)
        mWeb.save("mobile.xlsx")
        i = i+1
        
n = 1
#抓時間作者資料
titleInfos = root.find_all("div",class_="l-listTable__td l-listTable__td--time")
for titleInfo in titleInfos:
    author = titleInfo.div.a.text
    #timeInfo = titleInfo.div.next_sibling.text
    timeInfo = titleInfos.find("div" , class_ = "o-fNotes")
    ws.cell(n,4,author)
    ws.cell(n,5,timeInfo)
    mWeb.save("mobile.xlsx")
    n = n+1

url = "https://www.mobile01.com/" + page["href"]

還有另外一個問題就是不知為何我titleInfos = root.find_all("div",class="l-listTabletd l-listTabletd--time" 這行不會執行

看更多先前的討論...收起先前的討論...

alien663 iT邦研究生 1 級 ‧ 2022-12-14 17:00:09 檢舉

檔案讀出來的結果，你都寫decode("UTF-8")了，當然是字串阿，要給BS4解析成Class的形式回傳回來。

`titleInfos = root.find_all("div",class="l-listTabletd l-listTabletd--time"這段感覺就只是沒抓到而已，你確定你的條件寫的是對的?

the_rabbit iT邦新手 5 級 ‧ 2022-12-14 17:56:06 檢舉

喔對欸，我第一個問題解決了，謝謝大大
第二個我看不出來哪裡有抓錯所以我放上了目標網站的HTML，要抓出<a>Adenko</a>應該是抓有class="l-listTabletd l-listTabletd--time"的div在往下抓a吧，因為a內的class在每個a都有，但我剛剛看似乎是真的沒抓到任何內容

re.Zero iT邦研究生 5 級 ‧ 2022-12-14 19:20:30 檢舉

Update: 請無視我這段，因為我眼殘，後來才注意你附上的程式碼內是有 "class_" 而不是你們討論的 "class"……
[參考](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class)： "class" 是 Python 保留字，所以 BS4 用 "class_" 替用該關鍵字參數(keyword argument)。

re.Zero iT邦研究生 5 級 ‧ 2022-12-14 20:04:15 檢舉

因為我的回答額度已滿，所以我在這說明，內容格式自己判斷：

1. 請先用:
for titleInfo in titleInfos: print('■ ',str(titleInfo)[:70])
　判斷所有元素是一樣的格式，不然後續會出錯(雖說就算亂來也能用 "try; except:continue" 處理啦；一堆人就這樣被寵壞了…。)
2. "titleInfo.div.a.text" 的 ".text" 是哪來的？官方文件：
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
　裡面我找無；猜你是想用 "titleInfo.div.a.string" ？
3. 這行程式碼：
timeInfo = titleInfos.find("div" , class_ = "o-fNotes")
　為何是使用 "titleInfos" 而不是 "titleInfo"？
　另，要取文字可用：
timeInfo = titleInfo.find("div" , class_ = "o-fNotes").string
　(如沒意外的話～)

the_rabbit iT邦新手 5 級 ‧ 2022-12-14 23:46:37 檢舉

re.Zero 我明白了，謝謝大大

登入發表討論

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友回答

立即登入回答

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

bs4的find()被當成str的find()無法解決

尚未有邦友回答

標記使用者