我正在编写一个简单的脚本来从 here 中获取灰色大表.
我的代码如下:
import urllib2
from lxml import etree
html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read()
root = etree.XML(html)
但是我在最后一条语句中遇到错误。
Traceback (most recent call last):
File "D:\Workspace\afi100\afi100.py", line 13, in <module>
root = etree.XML(html)
File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577)
File "parser.pxi", line 1556, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:79602)
File "parser.pxi", line 1435, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:78449)
File "parser.pxi", line 943, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:75099)
File "parser.pxi", line 547, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71467)
File "parser.pxi", line 628, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72340)
File "parser.pxi", line 568, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71683)
XMLSyntaxError: Space required after the Public Identifier, line 3, column 59
知道如何解决这个错误吗?
谢谢。
最佳答案
您正在尝试使用 XML 解析器解析 HTML,您应该使用 lxml HTML 解析器。
import urllib2
from StringIO import StringIO
from lxml import etree
ufile = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx")
root = etree.parse(ufile, etree.HTMLParser())
print etree.tostring(root)
关于python - 解析 HTML : lxml error in Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4371004/