python - Scrapy 获取错误为 "DNS lookup failed"的网站

标签 python web-scraping scrapy web-crawler scrapy-spider

我正在尝试使用 Scrapy 获取“DNS 查找失败”网站上的所有链接。

问题是,每个没有任何错误的网站都在 parse_obj 方法上打印,但是当 url 返回 DNS 查找失败时,回调 parse_obj 不会调用

我想获取所有出现错误“DNS 查找失败”的域,我该怎么做?

日志:

2016-03-08 12:55:12 [scrapy] INFO: Spider opened
2016-03-08 12:55:12 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 12:55:12 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-08 12:55:12 [scrapy] DEBUG: Crawled (200) <GET http://domain.com> (referer: None)
2016-03-08 12:55:12 [scrapy] DEBUG: Retrying <GET http://expired-domain.com/> (failed 1 times): DNS lookup failed: address 'expired-domain.com' not found: [Errno 11001] getaddrinfo failed.

代码:

class MyItem(Item):
    url= Field()

class someSpider(CrawlSpider):
    name = 'Crawler'        
    start_urls = ['http://domain.com']
    rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', follow=True),)

    def parse_obj(self, response):
        item = MyItem()
        item['url'] = []
        for link in LxmlLinkExtractor(allow=()).extract_links(response):
            parsed_uri = urlparse(link.url)
            url = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
            print url

最佳答案

CrawlSpider 规则不允许传递 errbacks(真可惜)

这是 another answer 的变体我为捕获 DNS 错误付出了代价:

# -*- coding: utf-8 -*-
import random

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError


class HttpbinSpider(CrawlSpider):
    name = "httpbin"

    # this will generate test links so that we can see CrawlSpider in action
    start_urls = (
        'https://httpbin.org/links/10/0',
    )
    rules = (
        Rule(LinkExtractor(),
             callback='parse_page',
             # hook to be called when this Rule generates a Request
             process_request='add_errback'),
    )

    # this is just to no retry errors for this example spider
    custom_settings = {
        'RETRY_ENABLED': False
    }

    # method to be called for each Request generated by the Rules above,
    # here, adding an errback to catch all sorts of errors
    def add_errback(self, request):
        self.logger.debug("add_errback: patching %r" % request)

        # this is a hack to trigger a DNS error randomly
        rn = random.randint(0, 2)
        if rn == 1:
            newurl = request.url.replace('httpbin.org', 'httpbin.organisation')
            self.logger.debug("add_errback: patching url to %s" % newurl)
            return request.replace(url=newurl,
                                   errback=self.errback_httpbin)

        # this is the general case: adding errback to all requests
        return request.replace(errback=self.errback_httpbin)

    def parse_page(self, response):
        self.logger.info("parse_page: %r" % response)

    def errback_httpbin(self, failure):
        # log all errback failures,
        # in case you want to do something special for some errors,
        # you may need the failure's type
        self.logger.error(repr(failure))

        if failure.check(HttpError):
            # you can get the response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

这是您在控制台上得到的:

$ scrapy crawl httpbin
2016-03-08 15:16:30 [scrapy] INFO: Scrapy 1.0.5 started (bot: httpbinlinks)
2016-03-08 15:16:30 [scrapy] INFO: Optional features available: ssl, http11
2016-03-08 15:16:30 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'httpbinlinks.spiders', 'SPIDER_MODULES': ['httpbinlinks.spiders'], 'BOT_NAME': 'httpbinlinks'}
2016-03-08 15:16:30 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-08 15:16:30 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-08 15:16:30 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-08 15:16:30 [scrapy] INFO: Enabled item pipelines: 
2016-03-08 15:16:30 [scrapy] INFO: Spider opened
2016-03-08 15:16:30 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-08 15:16:30 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-08 15:16:30 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/0> (referer: None)
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/1>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/2>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/3>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/4>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/5>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/6>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/7>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/8>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching <GET https://httpbin.org/links/10/9>
2016-03-08 15:16:31 [httpbin] DEBUG: add_errback: patching url to https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/8> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/5
2016-03-08 15:16:31 [httpbin] ERROR: <twisted.python.failure.Failure twisted.internet.error.DNSLookupError: DNS lookup failed: address 'httpbin.organisation' not found: [Errno -5] No address associated with hostname.>
2016-03-08 15:16:31 [httpbin] ERROR: DNSLookupError on https://httpbin.organisation/links/10/9
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/8>
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/7> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/6> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/3> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/4> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/1> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/links/10/2> (referer: https://httpbin.org/links/10/0)
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/7>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/6>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/3>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/4>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/1>
2016-03-08 15:16:31 [httpbin] INFO: parse_page: <200 https://httpbin.org/links/10/2>
2016-03-08 15:16:31 [scrapy] INFO: Closing spider (finished)
2016-03-08 15:16:31 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 2,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
 'downloader/request_bytes': 2577,
 'downloader/request_count': 10,
 'downloader/request_method_count/GET': 10,
 'downloader/response_bytes': 3968,
 'downloader/response_count': 8,
 'downloader/response_status_count/200': 8,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 3, 8, 14, 16, 31, 761515),
 'log_count/DEBUG': 20,
 'log_count/ERROR': 4,
 'log_count/INFO': 14,
 'request_depth_max': 1,
 'response_received_count': 8,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 10,
 'scheduler/enqueued/memory': 10,
 'start_time': datetime.datetime(2016, 3, 8, 14, 16, 30, 427657)}
2016-03-08 15:16:31 [scrapy] INFO: Spider closed (finished)

关于python - Scrapy 获取错误为 "DNS lookup failed"的网站,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35866873/

相关文章:

python - Pandas pd.read_html() 函数给了我 'HTTP Error 403: Forbidden'

python-3.x - 安装Scrapy后,Python 3.7崩溃

python - 如何更改 scrapy view 命令使用的浏览器?

python - 如何识别并跟踪链接,然后使用 BeautifulSoup 从新网页打印数据

python - Python 递归列表

python - 仅保留两列 pandas 中不包含值的行

python - 解释 sys.stdin 的 python3 与 python 行为

web-scraping - 在 scrapy.Request 中添加 dont_filter=True 参数如何使我的解析方法起作用?

python - Scrapy 在执行时抛出 "ModuleNotFoundError"

python - 尝试运行 Anaconda Navigator 时出现段错误