python - 具有多个 start_url 的 Scrapy-playwright

标签 python scrapy playwright scrapy-playwright

讨论了类似的问题here但我无法使我的代码工作。目的是 scrapy-playwright 为 start_urls 中的每个 URL 生成请求响应,并以相同的方式解析每个响应。 带有 url 的 CSV 已正确读取到列表中,但 start_requests 未生成请求。请参阅下面的注释代码。

import scrapy
import asyncio
from scrapy_playwright.page import PageMethod

class MySpider(scrapy.Spider):
    name = "Forum01"
    allowed_domains = ["example.com"]

    def start_requests(self):
        with open('FullLink.csv') as file:
            start_urls = [line.strip() for line in file]
        print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
        
        for u in self.start_urls:    
            yield scrapy.Request(
                u,
                meta=dict(
                    playwright=True,
                    playwright_include_page=False,
                    playwright_page_methods=[
                        PageMethod("wait_for_selector", "div.modal-body > p")
                    ], # End of methods
                ), # End of meta
                callback=self.parse
            )

    async def parse(self, response): # Does not work either with sync or async
        for item in response.css('div.modal-content'):
            yield{
                'title': item.css('h1::text').get(),
                'info': item.css('.row+ p::text').get(),
            }   

您知道如何正确地将 URL 提供给蜘蛛吗? 谢谢!

最佳答案

您正尝试在 for 循环中迭代一个空序列,而不是从 csv 文件中提取的序列。

除非显式覆盖,否则 self.start_urls 将始终引用在 scrapy.Spider 构造函数中创建的空列表。删除 self.start_urlsself 部分应该可以解决您的问题。

import scrapy
import asyncio
from scrapy_playwright.page import PageMethod

class MySpider(scrapy.Spider):
    name = "Forum01"
    allowed_domains = ["example.com"]

    def start_requests(self):
        with open('FullLink.csv') as file:
            start_urls = [line.strip() for line in file] 
        print(start_urls) # When Scrapy crawl the list of URLs is correctly printed
        
        for u in self.start_urls: # <- change self.start_urls to just start_urls
            yield scrapy.Request(  #-----------------------------------
                u,
                meta=dict(
                    playwright=True,
                    playwright_include_page=False,
                    playwright_page_methods=[
                        PageMethod("wait_for_selector", "div.modal-body > p")
                    ], # End of methods
                ), # End of meta
                callback=self.parse
            )

    async def parse(self, response): # Does not work either with sync or async
        for item in response.css('div.modal-content'):
            yield{
                'title': item.css('h1::text').get(),
                'info': item.css('.row+ p::text').get(),
            }  

关于python - 具有多个 start_url 的 Scrapy-playwright,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/77528403/

相关文章:

python - 重新排序矩阵元素以反射(reflect)朴素python中的列和行聚类

iphone - 使用 Python 解析 .strings 文件

python - 简化日期比较

python - scrapy安装错误

automated-tests - 剧作家派森. - 如何检查元素是否隐藏

node.js - Heroku找不到Playwright文件

python - 寻找相等的字符串?

python - 我想使用 scrapy python 单击网站链接

python-3.x - scrapy 使用 CrawlerProcess.crawl() 将 custom_settings 从脚本传递给蜘蛛

ubuntu - 在 github actions ubuntu-latest 上收到 "Cache not found for input keys: cache-playwright-linux-1.20.0"