python - 使用 Pandas 读取下载的html文件

作为标题，我尝试使用 read_html 但出现以下错误:

In [17]:temp = pd.read_html('C:/age0.html',flavor='lxml')
  File "<string>", line unknown
XMLSyntaxError: htmlParseStartTag: misplaced <html> tag, line 65, column 6

我做错了什么？

更新01

HTML 在顶部包含一些 javascript，然后是一个 html 表格。我使用 R 通过 XML 包解析 html 来处理它，给我一个数据框。我想用 python 做，我应该在给 pandas 之前使用其他东西，比如 beautifulsoup 吗？

最佳答案

我认为您使用像 beautiful soup 这样的 html 解析器是在正确的轨道上。 pandas.read_html() 读取 html 表格而不是 html 页面。

你会想做这样的事情......

from bs4 import BeautifulSoup
import pandas as pd

table = BeautifulSoup(open('C:/age0.html','r').read()).find('table')
df = pd.read_html(table) #I think it accepts BeatifulSoup object
                         #otherwise try str(table) as input

关于python - 使用 Pandas 读取下载的html文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25056120/

上一篇：python - crypt 在 osx 中不起作用，返回错误值

下一篇：python - 如何单独捕获这些异常？

相关文章：

html - 如何集成 phpBB3 样式但使用完全不同的样式表？

php - 在 php 模板中将对象居中。如何？

javascript - 使用数学随机后从数组中删除使用过的元素

python - 这种在 Python 中导入模块的方法是否会导致循环？

mysql - Wagtail连接MySQL并导入数据

python - 如何测试 Python 中缺少特定库的情况

python - Numpy 一维数组 : Row or Column Matrix by Default?

python - Django 多对多关系不返回集合对象

python - 将列表项作为单独的条目插入数据库

python - 如何在 Raspberry Pi 终端中获取用户输入