python - 如何解析多个子页面、合并/追加并向上传递到父级?

标签 python web-scraping scrapy

这是我的第一个 scrapy 项目——不可否认,这也是我使用 python 的第一个练习之一。我正在寻找一种方法来抓取多个子页面,将内容合并/附加到单个值,并将数据向后/向上传递到原始父页面。每个父页面的子页面数量也是可变的 - 它可能少至 1,但永远不会是 0(可能对错误处理有帮助?)。此外,子页面可以重复并重新出现,因为它们不是单亲所独有的。我已经设法将父页面元数据向下传递到相应的子页面,但在完成相反的过程中遇到了困难。

这是一个示例页面结构:

Top Level Domain
     - Pagination/Index Page #1 (parse recipe links)
          - Recipe #1 (select info & parse ingredient links)
               - Ingredient #1 (select info)
               - Ingredient #2 (select info)
               - Ingredient #3 (select info)
          - Recipe #2
               - Ingredient #1
          - Recipe #3
               - Ingredient #1
               - Ingredient #2
     - Pagination/Index Page #2
          - Recipe #N
               - Ingredient #N
               - ...
     - Pagination/Index Page #3
     - ... continued

我正在寻找的输出(每个食谱)如下所示:

{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": "135 calories",
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}

我正在从相应的食谱页面中提取每种成分的 URL。我需要从每个成分页面中提取卡路里计数,将其与其他成分的卡路里计数合并,并理想地生成单个项目。由于单一成分并不专属于单一食谱,因此我需要能够在稍后的抓取过程中重新访问成分页面。

(注意 - 这不是真实的例子,因为卡路里数明显根据食谱所需的量而变化)

我发布的代码让我接近我正在寻找的东西,但我必须想象有一种更优雅的方法来解决问题。发布的代码成功地将食谱的元数据向下传递到成分级别,循环遍历成分并附加卡路里计数。由于信息是被传递下来的,所以我在成分层面上做出了让步,并创建了许多食谱重复项(每种成分一个),直到我循环使用最后一种成分。在此阶段,我希望添加成分索引号,以便我可以以某种方式保留每个食谱 URL 具有最大成分索引# 的记录。在我到达这一点之前,我想我应该向这里的专业人士寻求指导。

爬虫代码:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from recipe_scraper.items import RecipeItem

class RecipeSpider(CrawlSpider):
    name = 'Recipe'
    allowed_domains = ['www.example.com']
    start_urls = ['https://www.example.com/recipes/']
    rules = (
        Rule(
            LinkExtractor(
                allow=()
                ,restrict_css=('.pagination')
                ,unique=True
            )
            ,callback='parse_index_page'
            ,follow=True
        ),
    )

def parse_index_page(self, response):
    print('Processing Index Page.. ' + response.url)
    index_url = response.url
    recipe_urls = response.css('.recipe > a::attr(href)').getall()
    for a in recipe_urls:
        request = scrapy.Request(a, callback=self.parse_recipe_page)
        request.meta['index_url'] = index_url
        yield request

def parse_recipe_page(self, response):
    print('Processing Recipe Page.. ' + response.url)
    Recipe_url = response.url
    Recipe_title = response.css('.Recipe_title::text').extract()[0]
    Recipe_posted_date = response.css('.Recipe_posted_date::text').extract()[0]
    Recipe_instructions = response.css('.Recipe_instructions::text').extract()[0]
    Recipe_ingredients = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/text()').getall()
    Recipe_ingredient_urls = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/@href').getall()
    Recipe_calorie_list_append = []
    Recipe_calorie_list = []
    Recipe_calorie_total = []
    Recipe_item = RecipeItem()
    Recipe_item['index_url'] = response.meta["index_url"]
    Recipe_item['Recipe_url'] = Recipe_url
    Recipe_item['Recipe_title'] = Recipe_title
    Recipe_item['Recipe_posted_date'] = Recipe_posted_date
    Recipe_item['Recipe_instructions'] = Recipe_instructions
    Recipe_item['Recipe_ingredients'] = Recipe_ingredients
    Recipe_item['Recipe_ingredient_urls'] = Recipe_ingredient_urls
    Recipe_item['Recipe_ingredient_url_count'] = len(Recipe_ingredient_urls)
    Recipe_calorie_list.clear()
    Recipe_ingredient_url_index = 0
    while Recipe_ingredient_url_index < len(Recipe_ingredient_urls):
        ingredient_request = scrapy.Request(Recipe_ingredient_urls[Recipe_ingredient_url_index], callback=self.parse_ingredient_page, dont_filter=True)
        ingredient_request.meta['Recipe_item'] = Recipe_item
        ingredient_request.meta['Recipe_calorie_list'] = Recipe_calorie_list
        yield ingredient_request
        Recipe_calorie_list_append.append(Recipe_calorie_list)
        Recipe_ingredient_url_index += 1

def parse_ingredient_page(self, response):
    print('Processing Ingredient Page.. ' + response.url)
    Recipe_item = response.meta['Recipe_item']
    Recipe_calorie_list = response.meta["Recipe_calorie_list"]
    ingredient_url = response.url
    ingredient_calorie_total = response.css('div.calorie::text').getall()
    Recipe_calorie_list.append(ingredient_calorie_total)
    Recipe_item['Recipe_calorie_list'] = Recipe_calorie_list
    yield Recipe_item
    Recipe_calorie_list.clear()

事实上,我不太理想的输出如下(注意卡路里列表):

{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}

最佳答案

一种解决方案是分别抓取食谱和配料,作为不同的项目,然后在抓取完成后进行一些后处理,例如使用常规 Python,根据需要合并食谱和配料数据。这是最有效的解决方案。

或者,您可以从菜谱响应中提取成分 URL,而不是一次生成对所有成分的请求,您可以生成对第一种成分的请求,并将其余成分 URL 保存到新请求 meta,以及食谱项目。收到成分响应后,您将所有需要的信息解析为 meta 并生成对下一个成分 URL 的新请求。当不再有成分 URL 时,您将生成完整的食谱项目。

例如:

def _handle_next_ingredient(self, recipe, ingredient_urls):
    try:
        return Request(
            ingredient_urls.pop(),
            callback=self.parse_ingredient,
            meta={'recipe': recipe, 'ingredient_urls': ingredient_urls},
        )
    except IndexError:
        return recipe

def parse_recipe(self, response):
    recipe = {}, ingredient_urls = []
    # [Extract needed data into recipe and ingredient URLs into ingredient_urls]
    yield self._handle_next_ingredient(recipe, ingredient_urls)

def parse_ingredient(self, response):
    recipe = response.meta['recipe']
    # [Extend recipe with the information of this ingredient]
    yield self._handle_next_ingredient(recipe, response.meta['ingredient_urls'])

但请注意,如果两个或多个食谱可以具有相同的成分 URL,则您必须将 dont_filter=True 添加到您的请求中,从而对相同成分重复多个请求。如果成分 URL 不是特定于配方的,请认真考虑第一个建议。

关于python - 如何解析多个子页面、合并/追加并向上传递到父级?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55960550/

相关文章:

python - 随机森林回归的残差(Python)

python - 使用 pygame 绘制图元时出现工件?

python - 了解 : import modname -> and -> from modname import membername 之间的行为变化

python - 处理加载缓慢的网页,从我的脚本中消除硬编码延迟

python - BeautifulSoup 解析 Python

python - 在没有打印日志的情况下运行 scrapy runspider

python - 如何在Scrapy项目中使用PyMongo插入新记录MongoDB时删除重复项

python - 如何在 Python 中为 OSX 制作菜单栏(系统托盘)应用程序?

python - 从 BeautifulSoup 页面检索所有信息

python - 在脚本文件函数中获取 Scrapy 爬虫输出/结果