Python Scrapy - IP 网络掩码

<分区>

我尝试抓取 example.com，但在抓取 100 个页面后，该网站被阻止了。

我该如何纠正？

AWS 是否有助于避免阻塞？

最佳答案

请参阅 scrapy faq page 上的说明:

Avoiding getting banned Some websites implement certain measures to prevent bots from crawling them, with varying degrees of sophistication. Getting around those measures can be difficult and tricky, and may sometimes require special infrastructure. Please consider contacting commercial support if in doubt.

Here are some tips to keep in mind when dealing with these kind of sites:

rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)

disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour

use download delays (2 or higher). See DOWNLOAD_DELAY setting. if possible, use Google cache to fetch pages, instead of hitting the sites directly

use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh

use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera

If you are still unable to prevent your bot getting banned, consider contacting commercial support.

关于Python Scrapy - IP 网络掩码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21437718/

上一篇：python - 在 python 中处理大量对象的最佳方法

下一篇：python - 使用 smtpd.DebuggingServer 作为 STMP 服务器从 python 发送电子邮件

相关文章：

python - numpy 对于相同的代码返回一维数组和二维数组

python - 数据未使用 python mysql.connector 存储在 mysql 中

python - 使用python下载bing图像搜索结果(自定义url)

javascript - 使用R从javascript中提取数据

python - 如何根据其他字段提取csv文件某一字段的值？

python - PageRank 计算中矩阵向量积的稀疏矩阵

Javascript:正则表达式将所有相对 URL 更改为绝对

python - 正确安排两个for循环的结果

javascript - Splash 无法获取整个页面

python - Scrapy response.replace编码错误