python - 如果某个元素内有其他元素，如何在 scrapy 中选择该元素内的所有文本？

我有一个包含这样部分的页面。它基本上是主 p 标记中的一个问题，但每次出现某些上标时，它都会破坏我的代码。

我想要得到的文本是 - “对于任何三角形 ABC 的余弦规则，b2 等于”

<p><span class="mcq_srt">MCQ.</span>For Cosine Rule of any triangle ABC, b<sup>2</sup> is equal to</p>
    <ol>
        <li>a<sup>2</sup> - c<sup>2</sup> + 2ab cos A</li>
        <li>a<sup>3</sup> + c<sup>3</sup> - 3ab cos A</li>
        <li>a<sup>2</sup> + c<sup>2</sup> - 2ac cos B</li>
        <li>a<sup>2</sup> - c<sup>2</sup> 4bc cos A</li>
    </ol>

当我尝试对 p 进行选择时，我错过了应该是 super 脚本的 2。此外，我还在列表中得到两个句子，当我尝试存储答案时，这会弄乱一些事情

 response.css('p::text') > ["For Cosine Rule of any triangle ABC, b", "is equal to"]

我可以尝试使用选择

response.css('p sup::text')

然后尝试通过检查一个句子是否以小写字母开头来合并它，但当我有很多问题时，这就搞砸了。这是我在解析方法中所做的事情

`
    questions = [x for x in questions if x not in [' ']] #The list I get usually has a bunch of ' ' in them
    question_sup = response.css('p sup::text').extract()
    answer_sup = response.css('li sup::text').extract()
    all_choices = response.css('li::text')[:-2].extract() #for choice
    all_answer = response.css('.dsplyans::text').extract() #for answer

    if len(question_sup) is not 0:
        count=-1
        for question in questions:
            if question[1].isupper() is False or question[0] in [',', '.']: #[1] because there is a space at the starting
                questions[count]+=question_sup.pop(0)+question
                del questions[count+1]

            count+=1

我上面尝试的方法失败了很多次，我不知道如何调试它。我正在爬行很多页面，但我不知道如何调试它。我不断收到无法弹出空列表错误。我想，那是因为我上面尝试的方法出了问题。任何帮助将不胜感激!

最佳答案

如果您选择 p 中带有文本的所有元素(包括 p 本身)，您将获得遵循顺序的文本节点列表，因此您可以只需使用 '' 加入列表即可。这里:

>>> from scrapy.selector import Selector
>>> p = Selector(text='<p>For Cosine Rule of any triangle ABC, b<sup>2</sup> is equal to</p>')
>>> t = p.css('p::text, p *::text')  # Give me the text in <p>, plus the text of all of its descendants
>>> ''.join(t.extract())
'For Cosine Rule of any triangle ABC, b2 is equal to'

当然，您将失去 super 脚本符号。如果你需要保留它，你可以这样做:

>>> from scrapy.selector import Selector
>>> p = Selector(text='<p>For Cosine Rule of any triangle ABC, b<sup>2</sup> is equal to</p>')
>>> t = p.css('p::text, p *')
>>> result = []
>>> for e in t:
...     if type(e.root) is str:
...         result.append(e.root)
...     elif e.root.tag == 'sup':
...         result.append('^' + e.root.text)  # Assuming there can't be more nested elements
...     # handle other tags like sub
...
>>> ''.join(result)
'For Cosine Rule of any triangle ABC, b^2 is equal to'

关于python - 如果某个元素内有其他元素，如何在 scrapy 中选择该元素内的所有文本？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44071248/

python - 如果某个元素内有其他元素，如何在 scrapy 中选择该元素内的所有文本？

上一篇：python - 将 pandas DataFrame 中的列转换为具有 nan 值的 float

下一篇：python - 如何更新 python 中的全局变量？