python - 从任意嵌套的 HTML 中提取所有文本

我正在使用 Scrapy 从新闻网站提取新闻文章的文本。我假设 <p> 中的所有文本标签是实际的文章。 (这不一定是一个安全的假设，但这就是我正在处理的)找到所有 <p>标签，Scrapy 让我使用 css 选择器，如下所示:

response.css("p::text")

问题是一些新闻网站喜欢在文章中添加大量标记，如下所示:

<p>
    Senator <a href="/people/senator_whats_their_name">What&#39s-their-name</a> is <em>furious</em> about politics!
</p>

是否有一个 css 选择器，或者 Scrapy 中的其他一些简单方法，来提取文本并去除所有格式，从而产生类似这样的结果？

Senator What's-their-name is furious about politics!

问题是这些标签理论上可以任意嵌套:

<p>
    <span class="some-annoying-markup"><a href="who cares"><em>Wow this link must be important </em></a></span>
<p>

我仍然想提取文本

Wow this link must be important

我知道这是从 HTML 页面提取内容的一种非常简单的方法，但这超出了这个问题的范围。如果有更简单的方法来完成此任务，我会接受建议，但我在这个主题上发现的内容似乎比我在这里介绍的要复杂得多，所以我只是对解决我的问题感兴趣已经介绍过了。

最佳答案

In [7]: sel = Selector(text='''<p>
   ...:     Senator <a href="/people/senator_whats_their_name">What&#39s-their-n
   ...: ame</a> is <em>furious</em> about politics!
   ...: </p>''')

In [9]: sel.xpath('normalize-space(//p)').extract_first()
Out[9]: "Senator What's-their-name is furious about politics!"

或者:

In [10]: sel = Selector(text='''<p>
    ...:     <span class="some-annoying-markup"><a href="who cares"><em>Wow this
    ...:  link must be important </em></a></span>
    ...: <p>''')

In [11]: sel.xpath('normalize-space(//p)').extract_first()
Out[11]: 'Wow this link must be important'

使用xpath的string函数连接标签下的所有文本。

normalize-space 将去除字符串中的空格。

关于python - 从任意嵌套的 HTML 中提取所有文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42073902/

python - 从任意嵌套的 HTML 中提取所有文本

上一篇：python - 朱皮特 : Arguments in line magic commands

下一篇：python - 使用 np.where 匹配 pandas 单元格中的值，其中值是数组(ValueError : Arrays were different lengths)