python - Scrapy 将 %0A 添加到 URL,导致它们失败

标签 python encoding scrapy http-status-code-404 python-requests

我对此几乎束手无策。基本上我有一个似乎有点神奇的网址。具体是这样的:

https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031

当我使用请求时,一切正常:

import requests
test = requests.get("https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031")
<Response [200]>

但是,当我使用scrapy时,会弹出以下行:

Crawled (404) <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>

我什至尝试更新我的用户代理字符串,但无济于事。我的一部分担心是 URL 编码 %0A 造成的,但这似乎很奇怪,而且我找不到任何有关如何修复它的文档。

作为引用,这就是我发送请求的方式,尽管我不确定这会添加很多信息:

for url in review_urls:
    yield scrapy.Request(url, callback=self.get_review_urls)

值得注意的是,这是异常(exception)而不是规则。大多数 URL 都可以不受阻碍地工作,但这些边缘情况并不少见。

最佳答案

我不认为这是scrapy的问题,我怀疑你的review_urls有问题,

请从scrapy-shell找到这个演示,不知何故,您的网址在 url-encoding 期间以换行符结束(文档 here )那\n转换为%0A 。似乎您不小心在网址末尾添加了换行符,或者提取的网址包含额外的新换行符。

scrapy shell 'https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031'
2015-08-02 05:48:56 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-08-02 05:48:56 [scrapy] INFO: Optional features available: ssl, http11
2015-08-02 05:48:56 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2015-08-02 05:48:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, CoreStats, SpiderState
2015-08-02 05:48:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-02 05:48:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-02 05:48:56 [scrapy] INFO: Enabled item pipelines: 
2015-08-02 05:48:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-08-02 05:48:56 [scrapy] INFO: Spider opened
2015-08-02 05:48:58 [scrapy] DEBUG: Redirecting (302) to <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> from <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
2015-08-02 05:48:59 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s]   item       {}
[s]   request    <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s]   response   <200 http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s]   settings   <scrapy.settings.Settings object at 0x7fe365b91c50>
[s]   spider     <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
2015-08-02 05:48:59 [root] DEBUG: Using default logger
2015-08-02 05:48:59 [root] DEBUG: Using default logger

In [1]: url = 'https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031\n'

In [2]: fetch(url)
2015-08-02 05:49:24 [scrapy] DEBUG: Crawled (404) <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s]   item       {}
[s]   request    <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
[s]   response   <404 https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031%0A>
[s]   settings   <scrapy.settings.Settings object at 0x7fe365b91c50>
[s]   spider     <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

执行strip()在您发出请求之前在网址上,将为您提供所需的结果,如下所示,

In [3]: fetch(url.strip())
2015-08-02 05:53:01 [scrapy] DEBUG: Redirecting (302) to <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> from <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
2015-08-02 05:53:03 [scrapy] DEBUG: Crawled (200) <GET http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fe36d76fbd0>
[s]   item       {}
[s]   request    <GET https://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s]   response   <200 http://www.amazon.de/Instant-Video/b?ie=UTF8&node=3010075031>
[s]   settings   <scrapy.settings.Settings object at 0x7fe365b91c50>
[s]   spider     <DefaultSpider 'default' at 0x7fe36420d110>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

关于python - Scrapy 将 %0A 添加到 URL,导致它们失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31767333/

相关文章:

python - 这个实体定义模型是否太冗长了?

javascript - 解码使用 JS encodeURIComponent 函数编码的参数

file - 在 Haskell : hGetContents: invalid argument (invalid byte sequence) 中使用 "US-ASCII"编码读取文件

php - 处理 PHP/MySQL 导入的奇怪编码

python - 使用多个蜘蛛 headless 运行 Selenium

python - SCRAPY:如何将数据存入Mysql数据库

python - BeautifulSoup4 的文本缺失

python - 我如何在一个单元格中使用多个值进行一次热编码?

python - 在 Python 中的文件末尾声明函数

python - scrapy python 重新声明