python - 而不是匹配，在scrapy中获取相同的url

使用以下蜘蛛从该网站抓取多个字段。我遇到的问题是，我得到的网址适用于页面上的所有 16 个模型，然后又是另一个网址，再次适用于 16 个模型。我只是无法确定 url xpath 的问题。你能指出我的 url xpath 哪里有缺陷吗？谢谢。附注其他领域运行良好并且相匹配。缺少价格字段表示缺货型号。

class ZoomSpider(CrawlSpider):
name = "zoom2"
allowed_domains = ["zoomer.ge"]
start_urls = [
    "http://zoomer.ge/index.php?cid=35&act=search&category=1&search_type=mobile"
]

rules = (Rule (SgmlLinkExtractor(allow=("index.php\?cid=35&act=search&category=1&search_type=mobile&page=\d*", )) 
        , callback="parse_items", follow=True),)


def parse_items(self, response):
        sel = Selector(response)
        titles = sel.xpath('//div[@class="productContainer"]/div[5]/div[@class="productListContainer"]')
        items = []
        for t in titles:
        item = ZoomerItem()
            url = sel.xpath('//div[@class="productListImage"]/a/@href').extract()
            item["brand"] = t.xpath('div[3]/text()').re('^([\w\-]+)')
            item["price"] = t.xpath('div[@class="productListPrice"]/div/text()').extract()
            item["model"] = t.xpath('div[3]/text()').re('\s+(.*)$')[0].strip()
            item["url"] = urljoin("http://zoomer.ge", url[0])

            items.append(item)

        return(items)

enter image description here

最佳答案

您需要使用相对 xpath，通过您的 xpath，您始终会在应该使用的每个页面上获得第一个链接:

t.xpath('.//div[@class="productListImage"]/a/@href').extract()

注意开头的点。 Xpaths 应该相对于特定的选择器，在你的情况下，这是 for 循环中的“t”。

这是很常见的错误，it's described in scrapy docs

关于python - 而不是匹配，在scrapy中获取相同的url，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23437729/

python - 而不是匹配，在scrapy中获取相同的url

上一篇：Python缓存html文件

下一篇：python - python初学者错误中的变量增量