python - 使用 Python 进行网页抓取有时会获取结果有时会导致 HTTP 429

我正在尝试抓取 Reddit 页面上的视频。我正在使用 python 和 beautiful soup 来完成这项工作。下面的代码有时会返回结果，有时在我重新运行代码时不会返回结果。我不确定我哪里出错了。有人可以帮忙吗？我是 python 新手，所以请耐心等待。

import requests
from bs4 import BeautifulSoup


page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/')

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

最佳答案

如果你这样做print (page)在您的page = requests.get('https:/.........')之后，你会看到你获得成功 <Response [200]>

但是如果你再次快速运行它，你会得到 <Response [429]>

“HTTP 429 Too Many Requests 响应状态代码表示用户在给定时间内发送了太多请求(“速率限制”)。”来源here

此外，如果您查看 html 源代码，您会看到:

<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>

要添加 header 并避免添加 429:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)

完整代码:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

输出:

<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]

等待一两秒后重新运行多次没有问题

关于python - 使用 Python 进行网页抓取有时会获取结果有时会导致 HTTP 429，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54180671/

python - 使用 Python 进行网页抓取有时会获取结果有时会导致 HTTP 429

上一篇：python beautifullsoup websocket

下一篇：python - tensorflow 仅以单精度计算交叉熵吗？