python - Scrapy 蜘蛛在第一次请求 start_urls 后关闭

标签 python web-scraping scrapy web-crawler

我正在以与其他蜘蛛相同的结构运行我的蜘蛛,但对于这个特定的网站和这个特定的蜘蛛,它会在第一次请求启动 url 后关闭。可能是什么问题?

终端输出:

...
2022-04-03 17:42:34 [scrapy.core.engine] INFO: Spider opened
2022-04-03 17:42:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 17:42:34 [spiderTopo] INFO: Spider opened: spiderTopo
2022-04-03 17:42:34 [spiderTopo] INFO: Spider opened: spiderTopo
2022-04-03 17:42:34 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 17:42:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.topocentras.lt/> (referer: None)
2022-04-03 17:42:34 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 17:42:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 299,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 43691,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.293075,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 3, 14, 42, 34, 535151),
 'httpcompression/response_bytes': 267627,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 12,
 'memusage/max': 60579840,
 'memusage/startup': 60579840,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 3, 14, 42, 34, 242076)}
2022-04-03 17:42:34 [scrapy.core.engine] INFO: Spider closed (finished)

爬虫代码:

import scrapy
from bs4 import BeautifulSoup
import re
from pbl.items import PblSpider

base_url = 'https://www.topocentras.lt'

class PblItem(scrapy.Spider):
    name = 'spiderTopo'
    allowed_domains = ['topocentras.lt']
    start_urls = ['https://www.topocentras.lt/']

    def __init__(self):
        self.declare_xpath()

    def declare_xpath(self):
        self.getAllCategoriesXpath = '/html/body/div[1]/header[1]/nav/ul/li[1]/div/ul[1]/li/a/@href'
        self.getAllSubCategoriesXpath = '//*[@id="root"]/main/div/aside/div/ul/li/a/@href'
        self.getAllItemsXpath = '/html/body/div[1]/main/div/section/div[4]/div/article/div[1]/a/@href'
        self.TitleXpath  = '/html/body/div[2]/main/div[1]/div[2]/div/article/h1/text()'
        self.ImageXpath = '/html/body/div[2]/main/div[1]/div[2]/div/article/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]/img/@src'      
        self.PriceXpath = '/html/body/div[2]/main/div[1]/div[2]/div/article/div[2]/div[3]/div[1]/div[2]/div/span/text()'
    
    def parse(self, response):
        for href in response.xpath(self.getAllCategoriesXpath):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url,callback=self.parse_category,dont_filter=True)
 
    def parse_category(self,response):
        for href in response.xpath(self.getAllSubCategoriesXpath):
            url = response.urljoin(href.extract())
            print(response.body)
            yield scrapy.Request(url,callback=self.parse_subcategory,dont_filter=True)

    def parse_subcategory(self,response):
        for href in response.xpath(self.getAllItemsXpath):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url,callback=self.parse_main_item,dont_filter=True)

        #next_page = response.xpath('/html/body/main/section/div[1]/div/div[2]/div[1]/div/div[2]/div[3]/ul/li/a[@rel="next"]/@href').extract_first()
        #if next_page is not None:
        #    url = response.urljoin(next_page)
        #    yield scrapy.Request(url, callback=self.parse_category, dont_filter=True)
    
    def parse_main_item(self,response):
         item = PblSpider()
 
         Title = response.xpath(self.TitleXpath).extract()
         Title = self.cleanText(self.parseText(self.listToStr(Title)))

         Link = response.url
        
         Image = response.xpath(self.ImageXpath).extract_first()

         Price = response.xpath(self.PriceXpath).extract()
         Price = self.cleanText(self.parseText(self.listToStr(Price)))

         sub_price = response.xpath(self.SubPriceXpath).extract()
         sub_price = self.cleanText(self.parseText(self.listToStr(sub_price)))

    #     #Put each element into its item attribute.
         item['Title']          = Title
        #item['Category']      = Category
         item['Price']          = Price
        #item['Features']      = Features
         item['Image']          = Image
         item['Link']           = Link

         return item

我已经尝试更改 settings.py 文件中的用户代理,因为这是使用 scrapy shell 时的第一个问题,它曾经给出空列表。

我还尝试在运行蜘蛛之前在命令行中指定用户代理。

添加了 dont_filter=True 选项。

最佳答案

我认为您的类别 (getAllCategoriesXpath) 的 xpath 有问题。 我建议你简化它,例如,如果你想抓取我会使用的所有类别:

self.getAllCategoriesXpath = '//a[@class="CatalogTree-linkButton-1uH"]/@href'

关于python - Scrapy 蜘蛛在第一次请求 start_urls 后关闭,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71727008/

相关文章:

web-scraping - Google 表格中的 IMPORTXML 函数

python - Scrapy 图像管道警告 : File (unknown-error): Error downloading image from <GET

python - 如何在 Scrapy 中发送启用的 JavaScript 和 Cookies?

python - 使用数字作为标签名称来解析损坏的 XML

python - 在 pythonanywhere 托管的应用程序中找不到文件

python - Pandas 数据框返回列字符串中的第一个单词

python - 你如何利用 python aiohttp 框架的代理支持

Python POST 请求失败,[Errno 10054] 现有连接被远程主机强制关闭

html - Python 和 xpath : identify html tags with spaced attributes

python - 使用scrapy爬取时出错