python - 如何在Scrapy中使用多个嵌套跨度CSS选择器？

我正在处理一个棘手的 CSS 选择器问题，涉及多个嵌套的 span。

(A) 通常 HTML/CSS 看起来像这样:

<div class="pricing">
    <strong>1 200 €</strong> 
</div>

(B) 但也有一些部分看起来像这样:

<div class="pricing">
    <strong>
        <span class="promotion">
            <span class="promo-price">1 100 €</span>
        </span>
        <span class="strike">
            <span>1 200€</span>
        </span>
    </strong>
    <div class="new">New supplier</div>
</div>

(C) 像这样:

<div class="pricing">
    <strong>3 400 €</strong> 
    <span>/ best:  4500.00 €</span>
</div>

(D) 像这样:

<div class="pricing">
    <strong>4 900 €</strong> 
    <span class="netto">+ taxes</span> 
    <span>/ best:  4900.00 €</span>
</div>

<小时/>

使用以下类型的 Scrapy CSS 选择器:

response.css("div.pricing strong ::text").extract()
# ['2 500 €', '\n    ', '\n    ', '1 100 €', '\n    ', '\n    ', '1 200€', '3 999 €',...]

这表明有问题的<span ...>上述 CSS 的一部分，在选择器文本中添加空格。所以我试图忽略两者 strike和promotion使用 :not() 的各种变体的类像这样:

response.css("div.pricing strong:not([class*='promotion']):not([class*='strike'])::text").extract()
# <same result as above>

我还可以获得promo-price 仅，其中:

response.css("div.pricing  .promo-price::text").extract()
# ['1 100 €']

<小时/>

此时我不知道如何:

获取所有 (A) 价格
获取所有 (B) promo-price s(仅)
没有引入空格的结果(如上所示)
以上所有内容(最好)在一个 CSS 选择器或行中

问:我怎样才能以最简单的方式做到这一点？

<小时/>

注意:我已经看到过类似的问题:

但他们对我的情况没有提供太多帮助。

<小时/>

更新:

我无法按照@boltclock的指示完成任务，最终得到了一个丑陋的黑客，如下所示:

adPrice = aditem.css("div.pricing strong::text").extract_first().strip()
if adPrice == '':
    adPrice = aditem.css("div.pricing span.promo-price::text").extract_first()

因此，如果有人有更好或更优雅的解决方案......

最佳答案

嗯。

div.new 是否仅出现在包含所有复杂性 (B) 的 strong 之后，而从不出现在包含所有复杂性 (B) 的 strong 之后只是一个价格(A)？

如果是这样:

get all the (A) prices

result without the introduced white space (as shown above)

response.css("div.pricing strong:only-child::text").extract()

请注意 ::text 之前省略了空格，这确保您只能获取直接位于 strong 中的文本 — 请参阅我对 this question 的回答末尾使用指南。

:only-child 确保当 div.new 存在时它不匹配，如果它的不存在意味着 (A)，所以你永远不必担心关于(B)。

get all the (B) promo-prices (only)

response.css("div.pricing .promo-price::text").extract()

all of the above in (preferably) one CSS selector or line

此时，将上述两个选择器分组应该是一个简单的事情:

response.css("div.pricing strong:only-child::text, div.pricing .promo-price::text").extract()

如果 div.new 不相关，则使用 CSS 选择器将很难做到这一点，因为没有其他方法可以区分 (A) 和 (B)。另一方面，XPath 可以轻松完成它:

response.xpath("//div[@class='pricing']/(strong[not(./span)]|descendant::span[@class='promo-price'])/text()").extract()

关于python - 如何在Scrapy中使用多个嵌套跨度CSS选择器？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52496078/

python - 如何在Scrapy中使用多个嵌套跨度CSS选择器？

上一篇：Python - 在字符串中查找子字符串(代码不起作用)

下一篇：python - 在 Python 中格式化聚合数据帧的 header