python - Scrapy,未调用自定义方法

标签 python scrapy

我在用scrapy解析网页时遇到问题,我的custome方法没有被scrapy调用。网址是:http://www.duilian360.com/chunjie/117.html ,代码为:

import scrapy
from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):
    name = 'duilian_spider'
    start_urls = [
        {"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
    ]
    base_url = 'http://www.duilian360.com'

    def start_requests(self):
        for topic in self.start_urls:
            url = topic['url']
            yield scrapy.Request(url=url)

    def parse(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        self.parse_paragraph(div_list)

    def parse_paragraph(self, div_list):
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...')
                yield duilian_item

在上面的代码中,没有调用方法parse_paragraph,因为print子句没有输出,即使我在上面设置断点,我也无法单步进入这个方法打印行。

但是,如果我将方法 parse_paragraph 中的所有代码移至调用方法 parse_page 中,如下所示,那么一切都会正常运行,为什么?

# -*- coding: utf-8 -*-
import scrapy
from shufa.items import DuilianItem

class DuilianSpiderSpider(scrapy.Spider):
    name = 'duilian_spider'
    start_urls = [
        {"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
    ]
    base_url = 'http://www.duilian360.com'

    def start_requests(self):
        for topic in self.start_urls:
            url = topic['url']
            yield scrapy.Request(url=url)

    def parse(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        for div in div_list:
            duilian_text_list = div.xpath('./text()').extract()
            for duilian_text in duilian_text_list:
                duilian_item = DuilianItem()
                duilian_item['category_id'] = 1
                duilian = duilian_text
                duilian_item['name'] = duilian
                duilian_item['desc'] = ''
                print('I reach here...')
                yield duilian_item

    # def parse_paragraph(self, div_list):
    #     for div in div_list:
    #         duilian_text_list = div.xpath('./text()').extract()
    #         for duilian_text in duilian_text_list:
    #             duilian_item = DuilianItem()
    #             duilian_item['category_id'] = 1
    #             duilian = duilian_text
    #             duilian_item['name'] = duilian
    #             duilian_item['desc'] = ''
    #             print('I reach here...')
    #             yield duilian_item

我的代码有很多客户方法,我不希望将其中的所有代码移至调用方法。这不是一个好的做法。

最佳答案

我会使用yield from而不是直接调用parse_paragraph,因为它返回一个生成器而不是从另一个解析器生成项目/请求。

    def parse(self, response):
        div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
        yield from self.parse_paragraph(div_list)

关于python - Scrapy,未调用自定义方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58776085/

相关文章:

python - 导入错误 : No module named spiders

http - Scrapy 使用随机代理池以避免被禁止

python - mpi4py Reduce() 中可能的缓冲区大小限制

java - native 库中减少键/增加键支持

python - 使用具有各种功能的字典过滤 pandas 数据框

python - C++ 和 Python 紧密集成

python - 在运行时指定 Django 查询过滤器

python - 如何使用 Spidermon 监控特定的蜘蛛?

python - 为每次获取更改用户代理字符串

python - 抓取错误 "No module named cmdline"