python - 解析 XML 时出现标签不匹配错误？

我正在编写这个脚本，从 http://example.com/ 下载 HTML 文档并尝试使用以下方法将其解析为 XML:

with urllib.request.urlopen("http://example.com/") as f:
    tree = xml.etree.ElementTree.parse(f)

但是，我不断收到 ParseError: mismatched tag错误，据说是在第 1 行，第 2781 列，所以我手动下载了文件(在浏览器上按 Ctrl+S)并检查了它，但是这样的位置表示字符串中间的位置，甚至不在 EOF 附近，但是有实际第 2781 个字符之前有几行，因此这可能会扰乱我对确切位置的计算。但是，我尝试下载并实际将响应写入文件以便稍后解析:

response = urllib.request.urlopen("http://example.com/")
f = open("test.html", "wb")
f.write(response.read())
f.close()
html = open("test.html", "r")
tree = xml.etree.ElementTree.parse(html)

我仍然得到同样的mismatched tag同一列出错，但这次我打开下载的 html，第 2781 列附近唯一的内容是:

;</script></head><body class

确切的第 2781n 列标记 </head> 中的第一个“h” ，那么这里可能出了什么问题呢？我错过了什么吗？

编辑:

我一直在深入研究它，并尝试使用另一个解析器(这次是 minidom)解析 XML，但我仍然在同一行收到完全相同的错误，这可能是什么问题？即使我通过多种不同的方式(urllib、curl、wget，甚至在浏览器上按 Ctrl+Save)下载文件，也会发生这种情况，而且结果是相同的。

编辑2:

这是我迄今为止尝试过的:

这是我刚刚从 API 文档中获取的示例 xml，并将其保存到 text.html:

<html>
    <head>
        <title>Example page</title>
    </head>
    <body>
        <p>Moved to <a href="http://example.org/">example.org</a>
        or <a href="http://example.com/">example.com</a>.</p>
    </body>
</html>

我尝试过:

with urllib.request.urlopen("text.html") as f:
    tree = xml.etree.ElementTree.parse(f)

然后它就起作用了:

with urllib.request.urlopen("text.html") as f:
    tree = xml.etree.ElementTree.fromstring(f.read())

它也有效，但是:

with urllib.request.urlopen("http://example.com/") as f:
    xml.etree.ElementTree.parse(f)

不行，也试过:

with urllib.request.urlopen("http://example.com/") as f:
    xml.etree.ElementTree.fromstring(f.read())

而且也不起作用，可能是什么问题？据我所知，该文档没有不匹配的标签，但也许它太大了？只有 95.2 KB。

最佳答案

您可以使用bs4来解析此页面。像这样:

import bs4
import urllib


url = 'http://boards.4chan.org/wsg/thread/629672/i-just-lost-my-marauder-on-eve-i-need-a-ylyl'
proxies = {'http': 'http://www-proxy.ericsson.se:8080'}
f = urllib.urlopen(url, proxies=proxies)
info = f.read()
soup = bs4.BeautifulSoup(info)
print soup.a

输出:

<a href="/a/" title="Anime &amp; Manga">a</a>

你可以从这个link下载bs4 .

关于python - 解析 XML 时出现标签不匹配错误？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30146397/

python - 解析 XML 时出现标签不匹配错误？

上一篇：python - 递归取款和存款

下一篇：python - 在 OSX 启动时运行 Python 脚本