我正在尝试使用 scrapy 和 playwright 来抓取动态网页,我安装了 scrapy 和 playwright,但是,当我尝试运行我的蜘蛛时,我收到此错误。
导入错误:无法从“scrapy_playwright.page”导入名称“PageCoroutine”(Python\Scrapy\venv\lib\site-packages\scrapy_playwright\page.py 中的 C:\Ali\DataCamp\Web Scraping)
这是我的代码(它是测试代码):
import scrapy
from scrapy_playwright.page import PageCoroutine
class PwspiderSpider(scrapy.Spider):
name = 'pwspider'
def start_requests(self):
yield scrapy.Request("https://shoppable-campaign-demo.netlify.app/#/", meta=dict(playwright=True, playwright_include_page=True, playwright_page_coroutine=[PageCoroutine('wait_for_selector', 'div#productListing')]))
async def parse(self, response):
yield {'text': response.text}
我什至在设置文件中添加了 DOWNLOAD_HANDLERS 和 TWISTED_REACTOR。
最佳答案
PageCoroutine
已弃用/已废弃。请改用 playwright_page_methods
。
工作代码示例:
import scrapy
from scrapy_playwright.page import PageMethod
class TestSpider(scrapy.Spider):
name = "test"
def start_requests(self):
yield scrapy.Request(
url="https://shoppable-campaign-demo.netlify.app/#/",
callback=self.parse,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", '.card-body'),
],
},
)
def parse(self, response):
products = response.xpath('//*[@class="card-body"]')
for product in products:
yield {
'title':product.xpath('.//*[@class="card-title"]/text()').get()
}
输出:
{'title': 'Oxford Loafers'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'Ankle-length Slack'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'White Baseball Cap'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'Triangle Bikini Top'}
2022-11-05 20:40:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shoppable-campaign-demo.netlify.app/#/>
{'title': 'Short Blazer'}
2022-11-05 20:40:40 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-05 20:40:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 235,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 39851,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 41.370211,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 11, 5, 14, 40, 40, 261151),
'item_scraped_count': 5,
关于python - 无法从 'PageCoroutine' 导入名称 'scrapy_playwright.page',我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74328303/