我正在嘗試用requests把Youtube的影片透過網路爬蟲的方式爬下來(我知道有library可以直接用,但是之前因為Youtube反爬蟲更新,導致之前用的Pytube直接失效,因此想要自己試著爬,這樣就算之後yt改了反爬蟲也不至於從0開始重寫)
我先描述我現在的進度: 我已成功把youtube抓取video跟audio檔案的網址找到了
現在遇到的問題: 我看了許多用requests爬取的教學,但裡面都是用get方法去爬取影片的,但隨著反爬蟲的更新,現在已經要用post方法了! 而雖然我用了post方法,但是我還是抓不到檔案,以下是我的程式
import requests
import re
import json
import aiohttp
import aiofiles
import os
import asyncio
import random
url = 'https://www.youtube.com/watch?v=7LkIUfpX-k0'
user_agent_list = ["Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/61.0",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15",
]
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
"Range": "bytes=0-",
"Accept": "*/*",
"Connection": "keep-alive",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9"
}
cookies = {
"YSC": "e2LSE4IOe5g",
"VISITOR_INFO1_LIVE": "BHEJagtnezo",
"VISITOR_PRIVACY_METADATA": "CgJUVxIEGgAgag%3D%3D",
"PREF": "f4=4000000&tz=Asia.Taipei",
"GPS": "1"
}
response = requests.get(url=url, headers=headers, cookies=cookies)
ans = re.findall(
'var ytInitialPlayerResponse = (.*?);var', response.text)[0]
ans = json.loads(ans)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
"Referer": "https://www.youtube.com/",
"Range": "bytes=0-",
"Accept": "*/*",
"Connection": "keep-alive",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9"
}
url = ans['streamingData']['adaptiveFormats'][-2]['url']
headers['Content-length'] = ans['streamingData']['adaptiveFormats'][-2]['contentLength']
headers['User-Agent'] = random.choice(user_agent_list)
video = requests.post(url=url, headers=headers, cookies=cookies)
print(video.status_code)
if video.status_code == 200:
print(video.content)
with open('video.webm', mode='wb')as file:
file.write(video.content)
在我headers不加入Content-length前,他會一直status code: 403,但在我把Content-length加入cookie後,他卻一直出現以下報錯
Traceback (most recent call last):
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connection.py", line 466, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 1378, in getresponse
response.begin()
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\requests\adapters.py", line 667, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py", line 847, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\util\retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\util\util.py", line 38, in reraise
raise value.with_traceback(tb)
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py", line 793, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connectionpool.py", line 537, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\urllib3\connection.py", line 466, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 1378, in getresponse
response.begin()
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 318, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\http\client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\Users\Brian\Desktop\coding-practice\youtube\demo.py", line 54, in <module>
video = requests.post(url=url, headers=headers, cookies=cookies)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\requests\api.py", line 115, in post
return request("post", url, data=data, json=json, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Brian\AppData\Local\Programs\Python\Python311\Lib\site-packages\requests\adapters.py", line 682, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
希望有人能告訴我怎麼改謝謝(可以的話希望盡量不要使用selenium,因為我之後要丟上伺服器,用selenium很不方便)
(這是我第一次問問題所以可能章法有點亂,希望不要介意)
POST /videoplayback?expire=1724652730&ei=WsjLZqLaAqyNvcAP-P3HqQw&ip=220.130.216.120&id=o-ADgOY5ErqaKlaTHoqc4UdV8vRTNZDHiIG57JQSA1qJYf&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&mh=ru&mm=31%2C29&mn=sn-ipoxu-umbk%2Csn-un57snee&ms=au%2Crdu&mv=m&mvi=2&pl=24&ctier=A&pfa=5&initcwndbps=986250&hightc=yes&siu=1&spc=Mv1m9uTGGUanOfycH_wktspC2_HeA8d3gTzAfuv1AbITXDqlY6WSKi3DL-shLOU8TTlU6KH-Vg&svpuc=1&ns=b3R0trgOANLZimS9tV6BGP0Q&sabr=1&rqh=1&mt=1724630692&fvip=3&keepalive=yes&c=WEB&n=4ry5RPz8gxCZ6A&sparams=expire%2Cei%2Cip%2Cid%2Csource%2Crequiressl%2Cxpc%2Cctier%2Cpfa%2Chightc%2Csiu%2Cspc%2Csvpuc%2Cns%2Csabr%2Crqh&sig=AJfQdSswRgIhAM9fr4Lp8lDTLIpd0fgrW0vgO_5YYTrh7Fgc2znfYLz8AiEAo5W8FdfV0lClvQ2gkXSEv8-z3ctnucTbc3u67M2kEow%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl%2Cinitcwndbps&lsig=AGtxev0wRAIgO89CGLmgQXzcE1N_79mKra54ekJMdiiiVLSzjY6dU0QCIGy3kKQNykZ80-QvIiIWLxNv8uAVNkGjfsB_yFM_Rqr2&cpn=bKGSZJcyKECqVBBq&cver=2.20240823.01.00&rn=2 HTTP/1.1
Host: rr2---sn-ipoxu-umbk.googlevideo.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0
Accept: */*
Accept-Language: zh-TW,zh;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate, br, zstd
Content-Length: 2196
Referer: https://www.youtube.com/
Origin: https://www.youtube.com
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: cross-site
Connection: keep-alive
Priority: u=4
Pragma: no-cache
Cache-Control: no-cache
我沒認真爬過,不過youtube影片的request header應該是這樣。
Content-Length怎麼會放在cookie?
然後我也建議用外部人家寫好的來做。
然後cookie應該是在爬蟲開發的過程為了測試才從瀏覽器拿。cookie有可能會過期啊...
開發後要利用登錄機制作正常的request。甚至可以用selenium登錄拿cookie,因為登錄機制通常是有可能會涉及到跳轉頁面,拿到後再用request去打。
selenium有headless模式,裝完瀏覽器後程式碼開啟就不會有視窗了。
我之後又重新看了一下payload和查了許多資料,發現的確和你說的一樣我送的payload不對(我用的是第一個requests送回來的),而會no response的原因應該也是content-length不對導致的,而cookie其實應該不是必要的東西,但相對的,我缺少了他所需要的cpn和rn(cpn可能要直接去他的base.js檔裡面搞逆向,而rn應該是順序,但是不知道那個順序是怎麼來的)
我看你放content-length在cookie就有疑問了,因為正常來說都是放在header。
呃...我放的是header啊!
headers['Content-length'] = ans['streamingData']['adaptiveFormats'][-2]['contentLength']
問題應該是content-length的長度不對,應該不是直接從第一份下來的response抓,我比了一下,response裡的長度起碼是真正的10倍以上,所以content-length應該是可以不填的
但在我把Content-length加入cookie後,他卻一直出現以下報錯
你上面的原文。
request header裡的Content-length是request http post帶的body長度,應該就是上面post後面帶的那一大串的大小。
你可以看出有的參數是hash過的,所以參數大小應該可以固定大小。
你送出的是request,google回你的影片叫response,至於能不能不填我不知道,搞不好看你送過來請求格式大小不對就先把你斷掉了。
這是伺服器檢查的,我的作法通常都是不管怎樣,瀏覽器送出的都放上去就是了。
真的有興趣的話直接去看人家套件的源碼啦,人家都寫出來了,比你我在這猜還快。
呃...我的確想過直接看yt-dlp的原碼,可是翻了半天找不到他在哪裡爬的 @_@
https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/extractor/youtube.py
比較可疑的是這裡,不過要送的東西我猜是經過自動化處理的,沒看到上面的設定就是了。
畢竟這個套件看起來是先檢查你的browser,看有沒有存cookie,有的話會用,除了要手動輸入的部分大概都自動化了。
喔! 好,謝謝,我再看看(我的天啊! 裡面也好亂)