python - 尝试使用 Scrapy 抓取 LinkedIn 时出现 999 响应

标签 python web-scraping scrapy

我正在尝试使用 Scrapy 框架从 LinkedIn 中提取一些信息。 我知道他们对尝试抓取其网站的人非常严格,因此我在 settings.py 中尝试了不同的用户代理。我还指定了较高的下载延迟,但它似乎仍然立即阻止了我。

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 2
REDIRECT_ENABLED = False
RETRY_ENABLED = False
DEPTH_LIMIT = 5
DOWNLOAD_TIMEOUT = 10
REACTOR_THREADPOOL_MAXSIZE = 20
CONCURRENT_REQUESTS_PER_DOMAIN = 2
COOKIES_ENABLED = False
HTTPCACHE_ENABLED = True

这是我收到的错误:

2017-03-20 19:11:29 [scrapy.core.engine] INFO: Spider opened
2017-03-20 19:11:29 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2017-03-20 19:11:29 [scrapy.extensions.telnet] DEBUG: Telnet console listening on
127.0.0.1:6023
2017-03-20 19:11:29 [scrapy.core.engine] DEBUG: Crawled (999) <GET
https://www.linkedin.com/directory/people-1/> (referer: None) ['cached']
2017-03-20 19:11:29 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response
<999 https://www.linkedin.com/directory/people-1/>: HTTP status code is not handled or 
not allowed
2017-03-20 19:11:29 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-20 19:11:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 282,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2372,
'downloader/response_count': 1,
'downloader/response_status_count/999': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 3, 20, 17, 11, 29, 503000),
'httpcache/hit': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 3, 20, 17, 11, 29, 378000)}
2017-03-20 19:11:29 [scrapy.core.engine] INFO: Spider closed (finished)

蜘蛛本身只是打印访问过的网址。

class InfoSpider(CrawlSpider):
    name = "info"
    allowed_domains = ["www.linkedin.com"]
    start_urls = ['https://www.linkedin.com/directory/people-1/']
    rules = [
        Rule(LinkExtractor(
            allow=[r'.*']),
            callback='parse',
            follow=True)
    ]
    def parse(self, response):
        print(response.url)

最佳答案

请仔细注意请求中的 header LinkedIn 要求每个请求中包含以下 header 才能提供响应。

headers = {
    "accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "accept-encoding" : "gzip, deflate, sdch, br",
    "accept-language" : "en-US,en;q=0.8,ms;q=0.6",
    "user-agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
}

您可以引用this documentation了解更多信息。

关于python - 尝试使用 Scrapy 抓取 LinkedIn 时出现 999 响应,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42910269/

相关文章:

python - 根据我给定的列表获取行,而无需修改顺序或唯一列表

javascript - 使用 cheerio 从同一个 tr 中提取多个值?

go - 从给定与浏览器相同的参数的页面发出 GET 请求在 golang 上不起作用

javascript - 执行 javascript 代码以接受条款并打开下一页

python - python3上的scrapy如何获取在javascript上工作​​的文本数据

python - 如何定义神经网络的问题

python - 无法理解此代码片段的输出

python - Scrapy Max重定向问题

Python:将JSON(通过URL返回)转换成List

Python/BS - 从存储在目录中的 html 文件获取 URL,保存到 CSV