python - Beautiful Soup 将标准普尔变成标准普尔； AT&T 变成 AT&T；？

我正在使用 BeautifulSoup 4 (4.3.2) 解析一些相当困惑的 HTML 文档，并且遇到了一个问题，它会将公司名称变成 S&P(标准普尔)或 M&S (Marks and Spencer) AT&T 转换为 S&P;、M&S; 和 AT&T; .所以它想将 &[A-Z]+ 模式完成为一个 html 实体，但实际上并没有使用 html 实体查找表，因为 &P; 不是 html 实体.

我如何让它不这样做，或者我是否只需要用正则表达式匹配无效的实体并将它们改回来？

>>> import bs4
>>> soup = bs4.BeautifulSoup('AT&T announces new plans')
>>> soup.text
u'AT&T; announces new plans'

>>> import bs4
>>> soup = bs4.BeautifulSoup('AT&TOP announces new plans')
>>> soup.text
u'AT&TOP; announces new plans'

我已经在 OSX 10.8.5 Python 2.7.5 和 Scientifix Linux 6 Python 2.7.5 上尝试了以上内容

最佳答案

这似乎是 BeautifulSoup4 处理未知 HTML 实体引用的方式中的错误或功能。正如 Ignacio 在上面的评论中所说，预处理输入并将“&”符号替换为 HTML 实体(“&”)可能会更好。

但如果您出于某种原因不想这样做 - 我只能找到解决问题的唯一方法是通过“猴子修补”代码。这个脚本对我有用(Mac OS X 上的 Python 2.73):

import bs4

def my_handle_entityref(self, name):
     character = bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
     if character is not None:
         data = character
     else:
         #the original code mishandles unknown entities (the following commented-out line)
         #data = "&%s;" % name
         data = "&%s" % name
     self.handle_data(data)

bs4.builder._htmlparser.BeautifulSoupHTMLParser.handle_entityref = my_handle_entityref
soup = bs4.BeautifulSoup('AT&T announces new plans')
print soup.text
soup = bs4.BeautifulSoup('AT&TOP announces new plans')
print soup.text

它产生输出:

AT&T announces new plans
AT&TOP announces new plans

你可以在这里看到有问题的方法:

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L81

这里有问题的行:

http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L86

关于python - Beautiful Soup 将标准普尔变成标准普尔； AT&T 变成 AT&T；？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20612689/

python - Beautiful Soup 将标准普尔变成标准普尔； AT&T 变成 AT&T；？

上一篇：python - 为什么我在 python 中不断出现这个大错误。追溯(最近一次通话最后一次)......和 AttributeError

下一篇：python - python scikit 中更快的数据拟合(或学习)功能

python - Beautiful Soup 将标准普尔变成标准普尔； AT&T 变成 AT&T； ？

上一篇：python - 为什么我在 python 中不断出现这个大错误。追溯(最近一次通话最后一次)......和 ​​AttributeError

下一篇：python - python scikit 中更快的数据拟合(或学习)功能

python - Beautiful Soup 将标准普尔变成标准普尔； AT&T 变成 AT&T；？

上一篇：python - 为什么我在 python 中不断出现这个大错误。追溯(最近一次通话最后一次)......和 AttributeError