python - 如何向蜘蛛提供在蜘蛛内爬行的链接？

我正在为一家在线商店编写一个蜘蛛程序 (CrawlSpider)。根据客户的要求，我需要编写两个规则:一个用于确定哪些页面有项目，另一个用于提取项目。

我的两条规则已经独立运行:

如果我的 start_urls = ["www.example.com/books.php", "www.example.com/movies.php"] 并评论 Rule 和代码 parse_category 中，我的 parse_item 将提取每个项目。
另一方面，如果 start_urls = "http://www.example.com" 并且我注释Rule和parse_item的代码，parse_category将返回其中有要提取的项目的每个链接，即 parse_category 将返回 www.example.com/books.php 并且 www.example.com/movies.php。

我的问题是我不知道如何合并两个模块，因此 start_urls = "http://www.example.com" 然后 parse_category提取 www.example.com/books.php 和 www.example.com/movies.php 并将这些链接提供给 parse_item，我在其中实际上提取每个项目的信息。

我需要找到一种方法来做到这一点，而不是仅仅使用 start_urls = ["www.example.com/books.php", "www.example.com/movies.php"] 因为如果将来添加新类别(例如 www.example.com/music.php)，蜘蛛将无法自动检测到该新类别，应手动编辑。没什么大不了的，但客户不想要这个。

class StoreSpider (CrawlSpider):
    name = "storyder"

    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/"]
    #start_urls = ["http://www.example.com/books.php", "http://www.example.com/movies.php"]

    rules = (
        Rule(LinkExtractor(), follow=True, callback='parse_category'),
        Rule(LinkExtractor(), follow=False, callback="parse_item"),
    )

def parse_category(self, response):
    category = StoreCategory()
    # some code for determining whether the current page is a category, or just another stuff 
    if is a category:
        category['name'] = name
        category['url'] = response.url
    return category

def parse_item(self, response):
    item = StoreItem()
    # some code for extracting the item's data
    return item

最佳答案

CrawlSpider 规则无法按您想要的方式工作，您需要自己实现逻辑。当您指定 follow=True 时，您不能使用回调，因为其想法是在遵循规则的同时不断获取链接(无项目)，请检查 documentation

你可以尝试这样的事情:

class StoreSpider (CrawlSpider):
    name = "storyder"

    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/"]
    # no rules
def parse(self, response): # this is parse_category
    category_le = LinkExtractor("something for categories")
    for a in category_le.extract_links(response):
        yield Request(a.url, callback=self.parse_category)
    item_le = LinkExtractor("something for items")
    for a in item_le.extract_links(response):
        yield Request(a.url, callback=self.parse_item)
def parse_category(self, response):
    category = StoreCategory()
    # some code for determining whether the current page is a category, or just another stuff 
    if is a category:
        category['name'] = name
        category['url'] = response.url
        yield category
    for req in self.parse(response):
        yield req
def parse_item(self, response):
    item = StoreItem()
    # some code for extracting the item's data
    return item

关于python - 如何向蜘蛛提供在蜘蛛内爬行的链接？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33469129/

python - 如何向蜘蛛提供在蜘蛛内爬行的链接？

上一篇：python - Django: 'User' 对象不支持索引

下一篇：python - astropy.table.write() IOError : File exists: