python - 使用 Python 进行网页抓取有时会获取结果有时会导致 HTTP 429

标签 python html python-3.x beautifulsoup

我正在尝试抓取 Reddit 页面上的视频。我正在使用 python 和 beautiful soup 来完成这项工作。下面的代码有时会返回结果,有时在我重新运行代码时不会返回结果。我不确定我哪里出错了。有人可以帮忙吗?我是 python 新手,所以请耐心等待。

import requests
from bs4 import BeautifulSoup


page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/')

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

最佳答案

如果你这样做print (page)在您的page = requests.get('https:/.........')之后,你会看到你获得成功 <Response [200]>

但是如果你再次快速运行它,你会得到 <Response [429]>

“HTTP 429 Too Many Requests 响应状态代码表示用户在给定时间内发送了太多请求(“速率限制”)。”来源here

此外,如果您查看 html 源代码,您会看到:

<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>

要添加 header 并避免添加 429:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)

完整代码:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

输出:

<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]

等待一两秒后重新运行多次没有问题

关于python - 使用 Python 进行网页抓取有时会获取结果有时会导致 HTTP 429,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54180671/

相关文章:

python - cv_result中的 'mean_test_score'是什么意思?

html - 如何在 flexbox div 上显示内容

html - 我想在中心设置所有 <div class ="rectangle"> 标签边框

python - 这怎么是协程?

python - generate_blob_sas 创建无效的 SAS token

python - 来自同一模型的多个字段的 Django Haystack 索引

python - 将一个表面位 block 传输到另一个表面上而不合并 alpha

python - 如何在交互式 shell 中使用 ast.parse() 来解析文件

python - 在 ipython 中编辑先前定义的类的好方法

javascript - 如何在单击按钮时水平添加输入框