python - BeautifulSoup 库的 HTML 解析问题

我正在使用 BS 库进行 HTML 解析。我的任务是删除 head 标签之间的所有内容。所以如果我有 <head> A lot of Crap! </head>那么结果应该是<head></head> .这是它的代码

raw_html = "entire_web_document_as_string"
soup = BeautifulSoup(raw_html)
head = soup.head
head.unwrap()
print(head)

这很好用。但我希望这些更改应该发生在 raw_html 中。包含整个 html 文档的字符串。如何在原始字符串中而不是仅在 head 中反射(reflect)这些命令？字符串？你能分享一个代码片段吗？

最佳答案

您基本上是在问如何从 BS 的 soup 对象中导出 HTML 字符串。

你可以这样做:

# Python 2.7
modified_raw_html = unicode(soup)

# Python3
modified_raw_html = str(soup)

关于python - BeautifulSoup 库的 HTML 解析问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27671267/

相关文章：

python - 如何使用cv2分割图像？