python - Xpath text() 函数用法

我有一些 html 文件，其中包含以下内容:

<div>Chapter 1. <span>Contents of chapter N1.</span> </div>
<div>Chapter 2. <span>Contents of chapter N2.</span> </div>

我正在尝试提取这些标签中包含的文本并使用 xpath '//text()' 函数:

parser = etree.HTMLParser()
tree = etree.parse(StringIO(html),parser)
text = list(set( tree.xpath('//text()') ))
text = " ".join(text)

它工作正常，除了我想更改提取顺序。现在我得到以下结果:

Contents of Chapter N1. Contents of Chapter N2. Chapter 2. Chapter 1.

但我想得到的结果是:

Chapter 1. Contents of Chapter 1. Chapter 2. Contents of Chapter 2.

除了从文档顶部到底部递归处理每个标签之外，还有其他更好的方法吗？

最佳答案

您确定string(/)不会给你你想要的答案？它与文档 <p><i>Hello</i>!</p> 中的内容不太一样。它会给你"Hello!"而不是"Hello !" ，但在大多数情况下我认为这就是您想要的。

关于python - Xpath text() 函数用法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17471026/