python - 我正在尝试网络抓取 http ://angel. co/bloomfire

我正在尝试从网站 https://angel.co/bloomfire 抓取数据

import requests
from bs4 import BeautifulSoup

res = requests.get('https://angel.co/pen-io')
soup = BeautifulSoup(res.content, 'html.parser')
print(soup.prettify())

这将打印标题标签为“Page not found - 404 - AngelList”。在 webbrowser 中，网站运行良好，但其源代码与我的 python 脚本的输出不同。我也将 selenium 与 phantomjs 一起使用，但它显示了相同的内容

最佳答案

看起来angel.co将根据发送的User-Agent响应一个HTTP 404，并且看起来它会阻止默认的 >请求代理(可能取决于版本)。这可能会阻止机器人事件。

我的 ipython session 的一些输出如下。我正在使用 requests/2.17.3。

使用默认的Python请求用户代理

In [37]: rsp = requests.get('https://angel.co/bloom')
In [38]: rsp.status_code
Out[38]: 404

使用与 Mozilla 兼容的用户代理

In [39]: rsp = requests.get('https://angel.co/bloom', headers={'User-Agent': 'Mozilla/5.0'})

In [40]: rsp.status_code
Out[40]: 200

rsp.content 包含您希望从angel.co/bloom 看到的内容。

使用一些随机的用户代理

In [41]: rsp = requests.get('https://angel.co/bloom', headers={'User-Agent': 'birryree angel scraper'})

In [42]: rsp.status_code
Out[42]: 200

<小时/>

因此，您应该设置User-Agent以绕过用于各种默认代理的任何类型的过滤/阻止天使。

如果您要进行大量抓取，我建议您成为一个好公民并设置一个代理字符串，以便他们在您的抓取引起问题时与您联系，例如:

requests.get('https://angel.co/bloom', 
             headers={'User-Agent': 'Mozilla/5.0 (compatible; http://yoursite.com)'}

关于python - 我正在尝试网络抓取 http ://angel. co/bloomfire，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46078340/

python - 我正在尝试网络抓取 http ://angel. co/bloomfire

使用默认的Python请求用户代理

使用与 Mozilla 兼容的用户代理

使用一些随机的用户代理

上一篇：python - 如何定义一个函数以便可以在对象上调用它？

下一篇：python - Pandas 数据框分组值