python - 美丽汤提取物标签删除文字

我正在尝试使用 Beautifuloup 提取 html 标签并删除文本。以这个 html 为例:

html_page = """
<html>
<body>
<table>
<tr class=tb1><td>Lorem Ipsum dolor Sit amet</td></tr>
<tr class=tb1><td>Consectetuer adipiscing elit</td></tr>
<tr><td>Aliquam Tincidunt mauris eu Risus</td></tr>
<tr><td>Vestibulum Auctor Dapibus neque</td></tr>
</table>
</body>
</html>
"""

期望的结果是:

<html>
<body>
<table>
<tr><td></td></tr>
<tr><td></td></tr>
<tr><td></td></tr>
<tr><td></td></tr>
</table>
</body>
</html>

这是我到目前为止所得到的:

def get_tags(soup):
    copy_soup = soup
    for tag in copy_soup.findAll(True):
        tag.attrs = {} # removes attributes of a tag
        tag.string = ''

    return copy_soup

print get_tags(soup)

使用 tag.attrs = {} 可删除所有标签属性。但是当我尝试使用 tag.string 或 tag.clear() 时，我只剩下 <html></html> 。我知道可能发生的情况是在使用 tag.string 的第一次迭代中或tag.clear()正在删除 html 标签内的所有内容。

我不确定如何解决这个问题。也许首先递归地删除子项中的文本？还是我缺少更简单的方法？

最佳答案

实际上，我可以通过递归更新标签“children's”来删除文本。您还可以在递归中更新它们的属性。

from bs4 import BeautifulSoup
from bs4.element import NavigableString

def delete_displayed_text(element):
    """
    delete displayed text from beautiful soup tag element object recursively
    :param element: beautiful soup tag element object
    :return: beautiful soup tag element object
    """
    new_children = []
    for child in element.contents:
        if not isinstance(child, NavigableString):
            new_children.append(delete_displayed_text(child))
    element.contents = new_children
    return element

if __name__ =='__main__':
    html_code_sample = '<div class="hello">I am not supposed to be displayed<a>me neither</a></div>'
    soup = BeautifulSoup(html_code_sample, 'html.parser')
    soup = delete_displayed_text(soup)
    cleaned_soup = BeautifulSoup(str(soup), 'html.parser')
    print(cleaned_soup.getText())

关于python - 美丽汤提取物标签删除文字，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41641040/

python - 美丽汤提取物标签删除文字

上一篇：python - 从重复记录的字典列表中查找属性

下一篇：python - 获取两个单独查询之间的增量