python - 如何用 BeautifulSoup 连接两个 html 文件主体？

我需要将两个 html 文件的主体连接成一个 html 文件，中间用一些任意的 html 作为分隔符。我有用于此的代码，但是当我从 Xubuntu 11.10(或者是 11.04？)升级到 12.10 时停止工作，可能是由于 BeautifulSoup 更新(我目前使用的是 3.2.1；我不知道我以前有什么版本)或 vim 更新(我使用 vim 从纯文本文件自动生成 html 文件)。这是代码的精简版:

from BeautifulSoup import BeautifulSoup
soup_original_1 = BeautifulSoup(''.join(open('test1.html')))
soup_original_2 = BeautifulSoup(''.join(open('test2.html')))
contents_1 = soup_original_1.body.renderContents()
contents_2 = soup_original_2.body.renderContents()
contents_both = contents_1 + "\n<b>SEPARATOR\n</b>" + contents_2
soup_new = BeautifulSoup(''.join(open('test1.html')))
while len(soup_new.body.contents):
    soup_new.body.contents[0].extract()
soup_new.body.insert(0, contents_both)

用于测试用例的两个输入文件的主体非常简单:contents_1是\n<pre>\nFile 1\n</pre>\n'和 contents_2是'\n<pre>\nFile 2\n</pre>\n' .

我想要soup_new.body.renderContents()将这两个与中间的分隔 rune 本串联起来，而是所有 <的更改为 <等等 - 期望的结果是 '\n<pre>\nFile 1\n</pre>\n\n<b>SEPARATOR\n</b>\n<pre>\nFile 2\n</pre>\n' ，这是我在操作系统更新之前得到的；当前结果是'\n<pre>\nFile 1\n</pre>\n\n<b>SEPARATOR\n</b>\n<pre>\nFile 2\n</pre>\n' ，这没什么用。

如何让 BeautifulSoup 停止转动 <进入<等将 html 作为字符串插入 soup 对象的主体时？或者我应该以完全不同的方式来做这件事吗？(这是我对 BeautifulSoup 和大多数其他 html 解析的唯一经验，所以我猜这很可能就是这种情况。)

html文件是用vim从纯文本文件自动生成的(我使用的真实案例显然更复杂，并且涉及自定义语法高亮，这就是我这样做的原因)。完整的 test1.html 文件如下所示，test2.html 除了内容和标题外完全相同。

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>~/programs/lab_notebook_and_printing/concatenate-html_problem_2013/test1.txt.html</title>
<meta name="Generator" content="Vim/7.3" />
<meta name="plugin-version" content="vim7.3_v10" />
<meta name="syntax" content="none" />
<meta name="settings" content="ignore_folding,use_css,pre_wrap,expand_tabs,ignore_conceal" />
<style type="text/css">
pre { white-space: pre-wrap; font-family: monospace; color: #000000; background-color: #ffffff; white-space: pre-wrap; word-wrap: break-word }
body { font-family: monospace; color: #000000; background-color: #ffffff; font-size: 0.875em }
</style>
</head>
<body>
<pre>
File 1
</pre>
</body>
</html>

最佳答案

尝试将 HTML 作为文本读取只是为了将其插入 HTML 并在两个方向上对抗编码和解码会产生大量额外的工作，而这些工作很难正确完成。

简单的做法就是不要那样做。你想在 test1 主体中的所有内容之后插入 test2 主体中的所有内容，对吗？所以就这样做:

for element in soup_original_2.body:
    soup_original_1.body.append(element)

要先附加一个分隔符，只需对分隔符做同样的事情:

b = soup.new_tag('b')
b.append('SEPARATOR')
soup.original_1.body.append(b)
for element in soup_original_2.body:
    soup_original_1.body.append(element)

就是这样。

参见文档部分 Modifying the tree获取涵盖所有这些内容的教程。

关于python - 如何用 BeautifulSoup 连接两个 html 文件主体？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20132458/

python - 如何用 BeautifulSoup 连接两个 html 文件主体？

上一篇：html - LESS:在 :before 中使用 font-awesome

下一篇：javascript - 使导航栏在CSS中占据整个页面高度