python - 用lxml组合多个标签

我有一个 html 文件，如下所示:

...
<p>  
    <strong>This is </strong>  
    <strong>a lin</strong>  
    <strong>e which I want to </strong>  
    <strong>join.</strong>  
</p>
<p>
    2.
    <strong>But do not </strong>
    <strong>touch this</strong>
    <em>Maybe some other tags as well.</em>
    bla bla blah...
</p>
...

我需要的是，如果“p” block 中的所有标签都是“强”，那么将它们组合成一行，即

<p>
    <strong>This is a line which I want to join.</strong>
</p>

不要接触另一个 block ，因为它包含其他东西。

有什么建议吗？我正在使用 lxml。

更新:

到目前为止我尝试过:

for p in self.tree.xpath('//body/p'):
        if p.tail is None: #no text before first element
            children = p.getchildren()
            for child in children:
                if len(children)==1 or child.tag!='strong' or child.tail is not None:
                    break
            else:
                etree.strip_tags(p,'strong')

通过这些代码，我能够去掉所需部分中的强标记，给出:

<p>
      This is a line which I want to join.  
</p>

所以现在我只需要一种方法将标签放回...

最佳答案

我能够使用 bs4 (BeautifulSoup) 做到这一点:

from bs4 import BeautifulSoup as bs

html = """<p>  
<strong>This is </strong>  
<strong>a lin</strong>  
<strong>e which I want to </strong>  
<strong>join.</strong>  
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""

soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
    s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>

然后使用 replace_with() :

p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup

打印:

<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>

关于python - 用lxml组合多个标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30836928/

python - 用lxml组合多个标签

上一篇：javascript - wow.js 不工作

下一篇：javascript - 向右滑出元素，然后从左侧滑回