python - Scrapy 文件管道中的 "File (code: 302): Error downloading file"

标签 python scrapy

我正在尝试抓取以下蜘蛛:

import scrapy
from apkmirror.items import ApkmirrorItem


class ApkmirrorScraperSpider(scrapy.Spider):
    name = "apkmirror-scraper"
    allowed_domains = ["apkmirror.com"]

    custom_settings = {'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}

    start_urls = ['https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/']

    def parse(self, response):
        item = ApkmirrorItem()
        download_page_url = response.urljoin("download/")       # We assume that the 'actual' download page follows this naming convention. (This could also be extracted using response.css('.downloadButton').xpath('.//@href')).
        request = scrapy.Request(download_page_url, callback=self.parse_download_page)
        request.meta['item'] = item
        yield request

    def parse_download_page(self, response):
        '''Get the alternative download link from the 'actual' download page.'''
        item = response.meta['item']
        download_relative_url = response.xpath('//*[contains(text(), "Your download will start immediately.")]/a/@href').extract_first()
        download_url = response.urljoin(download_relative_url)
        item['file_urls'] = [download_url]
        yield item

items.py 在哪里

import scrapy

class ApkmirrorItem(scrapy.Item):
    file_urls = scrapy.Field()
    files = scrapy.Field()

settings.py包括文件管道的激活:

ITEM_PIPELINES = {
    'scrapy.pipelines.files.FilesPipeline': 1
}

FILES_STORE = '/tmp/apkmirror_test/files'

但是,由于日志中的 302 重定向,我收到了 WARNING:

2017-05-23 12:13:51 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: apkmirror)
2017-05-23 12:13:51 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'apkmirror', 'NEWSPIDER_MODULE': 'apkmirror.spiders', 'SPIDER_MODULES': ['apkmirror.spiders']}
2017-05-23 12:13:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2017-05-23 12:13:52 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-23 12:13:52 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-23 12:13:52 [scrapy.middleware] INFO: Enabled item pipelines:
['scrapy.pipelines.files.FilesPipeline']
2017-05-23 12:13:52 [scrapy.core.engine] INFO: Spider opened
2017-05-23 12:13:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-23 12:13:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-23 12:13:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/> (referer: None)
2017-05-23 12:13:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/download/> (referer: https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/)
2017-05-23 12:13:58 [scrapy.core.engine] DEBUG: Crawled (302) <GET https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041> (referer: None)
2017-05-23 12:13:58 [scrapy.pipelines.files] WARNING: File (code: 302): Error downloading file from <GET https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041> referred in <None>
2017-05-23 12:13:59 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.apkmirror.com/apk/google-inc/youtube/youtube-12-19-56-release/youtube-12-19-56-android-apk-download/download/>
{'file_urls': ['https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041'],
 'files': []}
2017-05-23 12:13:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-23 12:13:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1336,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 62710,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 2,
 'downloader/response_status_count/302': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 5, 23, 12, 13, 59, 51739),
 'item_scraped_count': 1,
 'log_count/DEBUG': 5,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'memusage/max': 47157248,
 'memusage/startup': 47157248,
 'request_depth_max': 1,
 'response_received_count': 3,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2017, 5, 23, 12, 13, 52, 187141)}
2017-05-23 12:13:59 [scrapy.core.engine] INFO: Spider closed (finished)

并且文件未下载。

关于这个似乎有一个问题(https://github.com/scrapy/scrapy/issues/2004),应该在 Scrapy 版本 1.4.0 中修复。但是,我很确定我正在运行 1.4,但我仍然收到此错误。我该如何解决?

附加信息 我发现使用命令很有启发性

scrapy shell https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041 -s USER_AGENT="Mozilla"

在启动 Scrapy shell 之前会导致以下日志:

2017-05-23 13:56:10 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-05-23 13:56:10 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla'}
2017-05-23 13:56:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-05-23 13:56:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-23 13:56:10 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-23 13:56:10 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-23 13:56:10 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-23 13:56:10 [scrapy.core.engine] INFO: Spider opened
2017-05-23 13:56:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.apkmirror.com/wp-content/uploads/uploaded/591e9ab20113f/com.google.android.youtube_12.19.56-1219563340_minAPI21(armeabi-v7a)(480dpi)_apkmirror.com.apk> from <GET https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041>
2017-05-23 13:56:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.apkmirror.com/wp-content/uploads/uploaded/591e9ab20113f/com.google.android.youtube_12.19.56-1219563340_minAPI21(armeabi-v7a)(480dpi)_apkmirror.com.apk> (referer: None)
2017-05-23 13:56:17 [traitlets] DEBUG: Using default logger
2017-05-23 13:56:17 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f67f9424438>
[s]   item       {}
[s]   request    <GET https://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=215041>
[s]   response   <200 https://www.apkmirror.com/wp-content/uploads/uploaded/591e9ab20113f/com.google.android.youtube_12.19.56-1219563340_minAPI21(armeabi-v7a)(480dpi)_apkmirror.com.apk>
[s]   settings   <scrapy.settings.Settings object at 0x7f67f0ae19b0>
[s]   spider     <DefaultSpider 'default' at 0x7f67f06ddbe0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

记录到包含 ?php 的给定 URL 被重定向到 https://www.apkmirror.com/wp-content/uploads/uploaded/591e9ab20113f/com.google。 android.youtube_12.19.56-1219563340_minAPI21(armeabi-v7a)(480dpi)_apkmirror.com.apk,这是我要下载的实际文件。可能 files_url 应该以类似的方式重定向?

最佳答案

根据文档(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#allowing-redirections),您必须设置

MEDIA_ALLOW_REDIRECTS = True

settings.py 中,对我有用。

关于python - Scrapy 文件管道中的 "File (code: 302): Error downloading file",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44134908/

相关文章:

python - 如何在特定字符之外的指定位置拆分重复字符串?

python - Azure 和 Databricks 寄予厚望

python - uwsgi下运行的单元测试Flask应用

python - NetworkX如何访问作为节点的对象的属性

python - 使用 PhantomJS 进行代理身份验证

python - 通过重写 CrawlSpider __init__() 方法来实现 scrapy 规则

python - 获得二维数组的列计数

python - 防止Scrapy在没有结果时生成空文件

python - InterfaceError :(sqlte3. InterfaceError)绑定(bind)参数0时出错

python - 为单个项目从多个来源收集数据的正确方法