我有一个 html 文件,如下所示:
...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...
我需要的是,如果“p” block 中的所有标签都是“强”,那么将它们组合成一行,即
<p>
<strong>This is a line which I want to join.</strong>
</p>
不要接触另一个 block ,因为它包含其他东西。
有什么建议吗?我正在使用 lxml。
更新:
到目前为止我尝试过:
for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')
通过这些代码,我能够去掉所需部分中的强标记,给出:
<p>
This is a line which I want to join.
</p>
所以现在我只需要一种方法将标签放回...
最佳答案
我能够使用 bs4 (BeautifulSoup) 做到这一点:
from bs4 import BeautifulSoup as bs
html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""
soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>
然后使用 replace_with()
:
p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup
打印:
<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>
关于python - 用lxml组合多个标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30836928/