python - Scrapy:抓取了 0 个页面(在 scrapy shell 中工作,但不适用于 scrapy scrapy Spider 命令)

标签 python scrapy

我在使用 scrapy 时遇到一些问题。它没有返回任何结果。我尝试将以下蜘蛛复制并粘贴到 scrapy shell 中,它确实有效。真的不确定问题是什么,但是当我用“scrapycrawl rxomega”运行它时,它不起作用。

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from iherb.items import IherbItem

class RxomegaSpider(CrawlSpider):
    name = 'rxomega'
    allowed_domains = ['http://www.iherb.com/']
    start_urls = ['http://www.iherb.com/product-reviews/Natural-Factors-RxOmega-3-Factors-EPA-400-mg-DHA-200-mg-240-Softgels/4251/',
            'http://www.iherb.com/product-reviews/Now-Foods-Omega-3-Cardiovascular-Support-200-Softgels/323/']
    #rules = (
    #    Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    #)

    def parse_item(self, response):
        print('hello')
        sel = Selector(response)
        sites = sel.xpath('//*[@id="mainContent"]/div[3]/div[2]/div')
        items = []
        for site in sites:
            i = IherbItem()
            i['review'] = site.xpath('div[5]/p/text()').extract()
            items.append(i)
        return items

我看到的消息是... scrapy爬行rxomega

2014-02-16 17:00:55-0800 [scrapy] INFO: Scrapy 0.22.0 started (bot: iherb)
2014-02-16 17:00:55-0800 [scrapy] INFO: Optional features available: ssl, http11, django
2014-02-16 17:00:55-0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'iherb.spiders', 'SPIDER_MODULES': ['iherb.spiders'], 'BOT_NAME': 'iherb'}
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-16 17:00:55-0800 [scrapy] INFO: Enabled item pipelines:
2014-02-16 17:00:55-0800 [rxomega] INFO: Spider opened
2014-02-16 17:00:55-0800 [rxomega] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-16 17:00:55-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6026
2014-02-16 17:00:55-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6083
2014-02-16 17:00:55-0800 [rxomega] DEBUG: Crawled (200) <GET http://www.iherb.com/product-reviews/Natural-Factors-RxOmega-3-Factors-EPA-400-mg-DHA-200-mg-240-Softgels/4251/> (referer: None)
2014-02-16 17:00:56-0800 [rxomega] DEBUG: Crawled (200) <GET http://www.iherb.com/product-reviews/Now-Foods-Omega-3-Cardiovascular-Support-200-Softgels/323/> (referer: None)
2014-02-16 17:00:56-0800 [rxomega] INFO: Closing spider (finished)
2014-02-16 17:00:56-0800 [rxomega] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 588,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 37790,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 2, 17, 1, 0, 56, 22065),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'response_received_count': 2,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2014, 2, 17, 1, 0, 55, 256404)}
2014-02-16 17:00:56-0800 [rxomega] INFO: Spider closed (finished)

最佳答案

genspider功能创建了一个CrawlSpider和parse_item,但教程使用Spider和parse。两者都是0.22版本。更改为 Spider 并解析上面的代码,它可以工作。

关于python - Scrapy:抓取了 0 个页面(在 scrapy shell 中工作,但不适用于 scrapy scrapy Spider 命令),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21819055/

相关文章:

Python:将 'list' 的 url 发送到 scrapy 蜘蛛进行抓取时出现问题

python - ScrapyJs(scrapy+splash)无法加载脚本,但splash服务器运行良好

python - 有没有办法抓取使用 python 加载的数据

python - 使用scrapy,python中的站点地图蜘蛛解析站点地图中具有不同url格式的url

python - Scrapy : restrict_css with bad formatted HTML

Python PyPDF2 合并旋转页面

python - 我有一个在本地服务器上运行的网站...需要帮助将其公开

python - 在嵌套 Python 字典中搜索键

python - 尝试使用Selenium通过xpath选择元素,但出现错误“无法找到元素”

python - Collat​​z 猜想回文