我在用scrapy解析网页时遇到问题,我的custome方法没有被scrapy调用。网址是:http://www.duilian360.com/chunjie/117.html ,代码为:
import scrapy
from shufa.items import DuilianItem
class DuilianSpiderSpider(scrapy.Spider):
name = 'duilian_spider'
start_urls = [
{"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
]
base_url = 'http://www.duilian360.com'
def start_requests(self):
for topic in self.start_urls:
url = topic['url']
yield scrapy.Request(url=url)
def parse(self, response):
div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
self.parse_paragraph(div_list)
def parse_paragraph(self, div_list):
for div in div_list:
duilian_text_list = div.xpath('./text()').extract()
for duilian_text in duilian_text_list:
duilian_item = DuilianItem()
duilian_item['category_id'] = 1
duilian = duilian_text
duilian_item['name'] = duilian
duilian_item['desc'] = ''
print('I reach here...')
yield duilian_item
在上面的代码中,没有调用方法parse_paragraph
,因为print
子句没有输出,即使我在上面设置断点,我也无法单步进入这个方法打印行。
但是,如果我将方法 parse_paragraph
中的所有代码移至调用方法 parse_page
中,如下所示,那么一切都会正常运行,为什么?
# -*- coding: utf-8 -*-
import scrapy
from shufa.items import DuilianItem
class DuilianSpiderSpider(scrapy.Spider):
name = 'duilian_spider'
start_urls = [
{"url": "http://www.duilian360.com/chunjie/117.html", "category_name": "春联", "group_name": "鼠年春联"},
]
base_url = 'http://www.duilian360.com'
def start_requests(self):
for topic in self.start_urls:
url = topic['url']
yield scrapy.Request(url=url)
def parse(self, response):
div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
for div in div_list:
duilian_text_list = div.xpath('./text()').extract()
for duilian_text in duilian_text_list:
duilian_item = DuilianItem()
duilian_item['category_id'] = 1
duilian = duilian_text
duilian_item['name'] = duilian
duilian_item['desc'] = ''
print('I reach here...')
yield duilian_item
# def parse_paragraph(self, div_list):
# for div in div_list:
# duilian_text_list = div.xpath('./text()').extract()
# for duilian_text in duilian_text_list:
# duilian_item = DuilianItem()
# duilian_item['category_id'] = 1
# duilian = duilian_text
# duilian_item['name'] = duilian
# duilian_item['desc'] = ''
# print('I reach here...')
# yield duilian_item
我的代码有很多客户方法,我不希望将其中的所有代码移至调用方法。这不是一个好的做法。
最佳答案
我会使用yield from而不是直接调用parse_paragraph
,因为它返回一个生成器而不是从另一个解析器生成项目/请求。
def parse(self, response):
div_list = response.xpath("//div[@class='contentF']/div[@class='content_l']/p")
yield from self.parse_paragraph(div_list)
关于python - Scrapy,未调用自定义方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58776085/