python - BeautifulSoup 在解析后返回间隔文本

标签 python html parsing web-scraping beautifulsoup

我正在抓取本地 html 文档。但是，当我用漂亮的汤解析它时，它以无法解析的丑陋(如下图所示)格式返回 html。

我使用的简单代码是:

path = 'alerts/myfile.htm'
file = open(os.path.abspath(path))
parser = BeautifulSoup(file,'html.parser')
file.close()

这件事让我发疯。你遇到过同样的问题吗？谢谢

最佳答案

看起来原始文件是 UTF-16 格式的。

无论出于何种原因，BeautifulSoup(..., from_encoding='utf-16le') 无法理解这种情况，但您可以通过在传递之前手动读取和解码文件来解决此问题它到 BS。

请参阅下面的文字记录，其中我创建了一个 UTF-16LE 的 HTML 文件，转储其内容，尝试将其直接传递到 BS4，最后使用上述解决方法。

$ echo '<html><div>hello</div></html>' | iconv -f utf-8 -t utf-16le > y.html
$ file y.html
$ xxd y.html
00000000: 3c00 6800 7400 6d00 6c00 3e00 3c00 6400  <.h.t.m.l.>.<.d.
00000010: 6900 7600 3e00 6800 6500 6c00 6c00 6f00  i.v.>.h.e.l.l.o.
00000020: 3c00 2f00 6400 6900 7600 3e00 3c00 2f00  <./.d.i.v.>.<./.
00000030: 6800 7400 6d00 6c00 3e00 0a00            h.t.m.l.>...
$ python
>>> import bs4
>>> s = bs4.BeautifulSoup(open('y.html'))
&lt;html&gt;&lt;div&gt;hello&lt;/div&gt;&lt;/html&gt;
>>> s = bs4.BeautifulSoup(open('y.html'), from_encoding='utf-16le')
&lt;html&gt;&lt;div&gt;hello&lt;/div&gt;&lt;/html&gt;
>>> s = bs4.BeautifulSoup(open('y.html'), 'html.parser', from_encoding='utf-16le')
&lt;html&gt;&lt;div&gt;hello&lt;/div&gt;&lt;/html&gt;
>>> d = open('y.html', 'rb').read().decode('utf-16le')
>>> d
'<html><div>hello</div></html>\n'
>>> s = bs4.BeautifulSoup(d)
>>> s
<html><div>hello</div></html>
>>>

关于python - BeautifulSoup 在解析后返回间隔文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54925083/

上一篇：php - Laravel - 验证数组输入请求数据在数据库中的唯一性

下一篇：json - 从 json 中引用 JSON Schema 的方式与 XML 引用 XML Schema 的方式类似

相关文章：

c++ - 手写递归上升解析器中的递归左递归

c# - 为什么 Roslyn 每种语言都有两个版本的语法？

尽管 try- except block ，Python 脚本仍以退出代码 255 退出

javascript - 使用 jQuery 修改选定表格单元格的样式属性？

html - LESS 和 IE9 过滤器 :none for svg gradient compatibility?

java - 在表达式的自定义解析器中结合 Java 解析器

java - Python 中的 toByteArray？

python - 用python3打印字符串格式: print from unpacked array *some* of the time

python - 在Python中创建多个需要能够互相调用的函数

jquery - 如何在轮播 slider 中添加图像之间的间隙或消除滑动时的故障