python - 从 html 页面中删除所有样式、脚本和 html 标记

标签 python html beautifulsoup

这是我目前所拥有的:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script"]): 
        script.extract()
    text = soup.get_text()
    return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"

cleaned = cleanme(testhtml)
print (cleaned)

正在努力删除脚本

最佳答案

看起来你快搞定了。您还需要删除 html 标签和 css 样式代码。这是我的解决方案(我更新了函数):

def cleanMe(html):
    soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

关于python - 从 html 页面中删除所有样式、脚本和 html 标记,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30565404/

相关文章:

python & BeautifulSoup : How to extract a tags' value which is in many others tags?

python - 最 'Pythonic'处理重载的方法

python - 静态类型检查条件

python - 是否可以使用 Kotlin 制作 Jython 模块?

php - 如果我要剥离标签/将它们转换为 HTML 实体,我是否需要使用 HTML Purifier?

python - 使用 BeautifulSoup4 提取 XML 标签中的属性

Python 和 Beautifulsoup 网页抓取 - 选择具有特定子标签的段落

python - matplotlib 中带垂直线的图例

javascript - 为什么 React 说不要在 <option> 元素上设置 'selected' 属性?

html - 将 magento 可配置产品选项下拉列表转换为可选链接或单选按钮