python - 使用 Python 解析 xhtml 页面的问题

标签 python xml xhtml python-3.x

你好,我正在尝试用 python 解析 xhtml 中的页面,但我收到此错误:

**xml.parsers.expat.ExpatError: unbound prefix: line 6, column 0**

[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] mod_wsgi (pid=9156): Exception occurred processing WSGI script '/home/hidura/webapps/karinapp/Suite/Gate.py'.
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] Traceback (most recent call last):
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/home/hidura/webapps/karinapp/Suite/Gate.py", line 32, in application
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     response = assistant(buildReq.extrctEnv(environ, location))#Here the assistant takes the parameters and begins the work
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/home/hidura/webapps/karinapp/Suite/wsgi/Utilities/Assistant/Assistant.py", line 114, in __init__
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     self.websearch()#Finding the web.
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/home/hidura/webapps/karinapp/Suite/wsgi/Utilities/Assistant/Assistant.py", line 364, in websearch
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     websource = self.manage.string2parse(result[0][1])#Transforming the web page into a tree.
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/home/hidura/webapps/karinapp/Suite/wsgi/Writer/tagsmanip.py", line 56, in string2parse
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     self.doc = parseString(newData)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/usr/local/lib/python3.1/xml/dom/minidom.py", line 1937, in parseString
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     return expatbuilder.parseString(string)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/usr/local/lib/python3.1/xml/dom/expatbuilder.py", line 940, in parseString
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     return builder.parseString(string)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]   File "/usr/local/lib/python3.1/xml/dom/expatbuilder.py", line 223, in parseString
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1]     parser.Parse(string, True)
[Fri Mar 25 09:58:21 2011] [error] [client 127.0.0.1] xml.parsers.expat.ExpatError: unbound prefix: line 6, column 0

这是页面的代码:

<HTML xmlns:fb="http://www.facebook.com/2008/fbml"><HEAD><TITLE id="ttl">KarinApp(Karina application web maker)</TITLE><LINK id="css_front_1" type="text/css" href="http://www.karinapp.com/modules/front/css/main.css" rel="stylesheet"/><SCRIPT type="text/javascript" id="jQuery-front" src="/modules/general/scripts/jQuery.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="gnrlScrpt" src="/modules/general/scripts/general.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="ctchScrpt" src="/modules/general/scripts/Catcher.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="pdloadScr" src="/modules/general/scripts/loadPage.js"><!--empty--></SCRIPT><SCRIPT type="text/javascript" id="pdLoader">window.onload = function(){postLoad();
        }
function __init__(){main();}</SCRIPT><LINK id="link1" href="/modules/front/css/jquery-ui-1.8.10.custom.css" type="text/css" rel="stylesheet"/><SCRIPT id="script5" src="/modules/front/scripts/ui/jquery.ui.core.js"><!--empty--></SCRIPT><SCRIPT id="script6" src="/modules/front/scripts/ui/jquery.ui.widget.js"><!--empty--></SCRIPT><SCRIPT id="script8" src="/modules/front/scripts/ui/jquery.ui.button.js"><!--empty--></SCRIPT><SCRIPT id="script10" src="/modules/front/scripts/main.js"><!--empty--></SCRIPT><SCRIPT id="script9"><!--empty--></SCRIPT><SCRIPT id="script11" type="text/javascript" src="http://connect.facebook.net/en_US/all.js#appId=150388711687556&amp;amp;xfbml=1"><!--empty--></SCRIPT></HEAD><BODY id="body"><IMG id="logo" father="@body" src="/modules/front/image/logo.png"/><DIV id="comments" father="@body"><!--Comment--><DIV id="fbK" father="@comments"><IFRAME src="http://www.facebook.com/plugins/likebox.php?href=http%3A%2F%2Fwww.facebook.com%2Fpages%2FKarinapp%2F150388711687556&amp;width=295&amp;colorscheme=light&amp;show_faces=false&amp;stream=true&amp;header=false&amp;height=300" scrolling="no" frameborder="1" style="border:none; overflow:hidden; width:295px; height:300px;" allowtransparency="false">&amp;lt;!--empty--&amp;gt;</IFRAME>

<LIKE-BOX href="http://www.facebook.com/pages/Karinapp/150388711687556" width="295" show_faces="false" stream="true" header="false"><!--empty--></LIKE-BOX></DIV></DIV><DIV id="head" father="@body"><!--Comment--></DIV><A id="fb" father="@body" href="http://www.facebook.com/karinapp#!/pages/Karinapp/150388711687556" border="0"><IMG src="/modules/front/image/fb.png" father="@fb"/></A><A id="tw" father="@body" href="http://www.twitter.com/#!/karinappm" border="0"><IMG src="/modules/front/image/tw.png" father="@tw"/></A><DIV id="div4" father="@body"><DIV id="fb-root"><!--empty--></DIV>
<FB:LOGIN-BUTTON xmlns:fb="http://www.facebook.com/2008/fbml" show-faces="true" width="250" max-rows="1"/></DIV></BODY></HTML>

提前致谢!

最佳答案

问题是 expat 使用 fb 作为 namespace 前缀,但该标记是 FB:LOGIN-BUTTON。 Expat 将 FB 视为未绑定(bind)。 XHTML 规范指出所有 HTML 元素和属性 must be lowercase因为 XML 区分大小写。

我使用 lxml XML parser 尝试了您的文档并将前缀自动转换为小写。也许您可以切换到不同的解析器:

import lxml.etree
data = open('fb.xhtml', 'rb').read()
tree = lxml.etree.fromstring(data)
ns_map = {'fb': 'http://www.facebook.com/2008/fbml'}
print tree.xpath('.//fb:LOGIN-BUTTON', namespaces=ns_map)

输出:

[<Element {http://www.facebook.com/2008/fbml}LOGIN-BUTTON at 1011fa260>]

关于python - 使用 Python 解析 xhtml 页面的问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5434288/

相关文章:

python - 不均匀的线图绘制和日期格式的更改

python - 根据ID计算四分位数

xml - Xquery 中的返回语句

java - 在 XML 中设置 JavaBeans 集合属性

css - 如何将 div 定位在另一个带有图像的 div 下方?

html - 如何在 XHTML 中使用 HTML5 特性

python - 逻辑回归无法拟合我的数据

c# - 实时系统中的MSMQ

xhtml - 即使对于一个搜索框输入,我们也应该使用 fieldset 吗?

python - 如何在 Python 3.6 中根据需要打印一组嵌套列表