python - 在使用scrapy进行网络抓取时进行调试?

标签 python scrapy

我正在尝试使用scrapy 抓取网站(下载延迟为 10 秒 + AUTOTHROTTLE_ENABLED = True + ROBOTSTXT_OBEY = True)。当我运行命令时:

scrape crawl myspider -o mydata.csv

我得到多个“行”的输出,直到 405:
[protego] DEBUG: Rule at line 4 without any user agent to enforce it on.

并且:
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <405 https://www.funda.nl/koop/utrecht/>: HTTP status code is not handled or not allowed

为什么scrapy不能抓取这个网站?

这是倾销统计数据(如果有用):
2019-11-03 00:33:55 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 417,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 23402,
 'downloader/response_count': 2,
 'downloader/response_status_count/405': 2,
 'elapsed_time_seconds': 13.42723,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 11, 2, 23, 33, 55, 12777),
 'httperror/response_ignored_count': 1,
 'httperror/response_ignored_status_count/405': 1,
 'log_count/DEBUG': 61,
 'log_count/INFO': 11,
 'memusage/max': 51818496,
 'memusage/startup': 51818496,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/405': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 11, 2, 23, 33, 41, 585547)}

最佳答案

您需要模拟与真实浏览器完全相同的请求

headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'DNT': '1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36',
    'Sec-Fetch-User': '?1',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
}

yield scrapy.Request('https://www.funda.nl/koop/utrecht/', headers=headers)

关于python - 在使用scrapy进行网络抓取时进行调试?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58676238/

相关文章:

python - 迭代文本并查找预定义子字符串之间的距离

python - 关于 XPath 选择器的问题(针对 Scrapy)

python - Scrapy : Create csv file with spider name

python - 在 google colab 中重定向或查看 stderr

Python line.split 只分割单个字符

python - 如何从 scipy.special 正确调用 erf 函数?

python - Scrapy:如何调试 scrapy 丢失的请求

python - 在没有事件项目的情况下使用 Scrapy 抓取本地文件?

python - Scrapy 规则如何与爬虫一起使用

c++ - Maya MFnPlugin::registerUI 调用 Python 而不是 MEL 脚本