python - 无法从 'PageCoroutine' 导入名称 'scrapy_playwright.page'

标签 python web-scraping scrapy

我正在尝试使用 scrapy 和 playwright 来抓取动态网页,我安装了 scrapy 和 playwright,但是,当我尝试运行我的蜘蛛时,我收到此错误。

导入错误:无法从“scrapy_playwright.page”导入名称“PageCoroutine”(Python\Scrapy\venv\lib\site-packages\scrapy_playwright\page.py 中的 C:\Ali\DataCamp\Web Scraping)

这是我的代码(它是测试代码):

import scrapy
from scrapy_playwright.page import PageCoroutine

class PwspiderSpider(scrapy.Spider):
    name = 'pwspider'
    
    def start_requests(self):
        yield scrapy.Request("https://shoppable-campaign-demo.netlify.app/#/", meta=dict(playwright=True, playwright_include_page=True, playwright_page_coroutine=[PageCoroutine('wait_for_selector', 'div#productListing')]))

    async def parse(self, response):
        yield {'text': response.text}

我什至在设置文件中添加了 DOWNLOAD_HANDLERS 和 TWISTED_REACTOR。

最佳答案

PageCoroutine 已弃用/已废弃。请改用 playwright_page_methods

工作代码示例:

import scrapy
from scrapy_playwright.page import PageMethod

class TestSpider(scrapy.Spider):
    name = "test"
    def start_requests(self):
        yield scrapy.Request(

            url="https://shoppable-campaign-demo.netlify.app/#/",
            callback=self.parse,
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", '.card-body'),
                ],
            },
        )

    def parse(self, response):
        
        products = response.xpath('//*[@class="card-body"]')
        for product in products:
            yield {
            'title':product.xpath('.//*[@class="card-title"]/text()').get()
          
            }

输出:

{'title': 'Oxford Loafers'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'Ankle-length Slack'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'White Baseball Cap'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'Triangle Bikini Top'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'Short Blazer'}
2022-11-05 20:40:40 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-05 20:40:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 235,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 39851,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 41.370211,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 11, 5, 14, 40, 40, 261151),
 'item_scraped_count': 5,

关于python - 无法从 'PageCoroutine' 导入名称 'scrapy_playwright.page',我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74328303/

相关文章:

python - 带有语句的 Python 中的 SQLite 游标

eclipse - Scrapy + Eclipse PyDev : how to setup the debugger?

xpath - 在单个节点中获取所有包含html的文本scrapy xpath

python - 使用python请求通过表单数据上传图像

python - 无法以字符串格式方法更新defaultdict

python - 如何用python抓取季度和特定日期的雅虎财务数据?

html - 为不同的专利进行网络抓取时,子编号会发生变化

javascript - 获取网页资源的控制台应用程序,使用c#(javascript可能导致此)

python - Scrapy-deploy 到 Scrapyd 不会安装 setup.py 中指出的要求

c++ - 用于访问 BeagleBone Black 的库(3.8 Kerne - Angstrom)