python - html2text : How to parse urls containing special characters?

我正在尝试使用 Aaron Swartz 的 Python html2text库(在 Python 2.7 上)。我没有成功地在包含 URL 具有特殊字符的链接的字符串上使用 html2text() 。例如:

# -*- coding: utf-8 -*-
import html2text
s = u'Link <a href="https://en.wikipedia.org/wiki/Málaga">here</a>!'
str = html2text.html2text(s)

因错误而失败:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 31: ordinal not in range(128)

鉴于:

# -*- coding: utf-8 -*-
import html2text
s = u'<a href="https://en.wikipedia.org/wiki/Malaga">héré</a>!'
str = html2text.html2text(s)

(有特殊字符，但仅在文本中，不在 URL 中)工作得很好。

我肯定遗漏了编码的某些内容，但我在文档中找不到任何内容。有没有办法告诉 html2text 在其 url 解析器中使用非 ascii 编码器？

最佳答案

您可以使用 urllib.quote 对非ascii字符进行编码(Python3 中的 urllib.parse.quote)。 safe 中指定的字符参数不会被编码。

import html2text
from urllib import quote

s = 'Link <a href="https://en.wikipedia.org/wiki/Málaga">here</a>!'
q = quote(s, safe=' <>="/:!')
s = html2text.html2text(q)

print q
print s

Link <a href="https://en.wikipedia.org/wiki/M%C3%A1laga">here</a>!
Link [here](https://en.wikipedia.org/wiki/M%C3%A1laga)!

<小时/>

href 中不能包含 unicode 字符，因为它采用字符串格式。错误来自html2text.HTML2Text.close第 163 行:outtext = nochr.join(self.outtextlist) ，其中nochr是 unicode('') ，和self.outtextlist是标签部分的列表:

[u'Link ', '[', u'h\xe9r\xe9', '](https://en.wikipedia.org/wiki/Mlaga)', u'!', '\n', '']

如您所见，包含 href 的项目不是 unicode 字符串。

那是因为在 html2text.HTML2Text.handle_tag ，在函数 link_url 中，第440行，url被格式化为字符串:']({url}{title})'.format(url=escape_md(url), title=title) .
如果将其更改为 unicode ( u']({url}{title})' )，您将在 self.outtextlist 中得到一个 unicode 字符串。 :

[u'Link ', '[', u'h\xe9r\xe9', u'](https://en.wikipedia.org/wiki/Ml\xe1ga)', u'!', '\n','']

以及输出 u'Link <a href="https://en.wikipedia.org/wiki/Mlága">héré</a>!'将是:

Link [héré](https://en.wikipedia.org/wiki/Mlága)!

但是我不建议修改原始代码。一个可能的解决方案是子类 HTML2Text并覆盖link_url ，但问题是link_url是一个本地函数，因此您必须覆盖整个 handle_tag方法。

关于python - html2text : How to parse urls containing special characters?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47704164/

python - html2text : How to parse urls containing special characters?

上一篇：Python 在 BeautifulSoup 中解析时跳过 XML 子节点

下一篇：python - TransformationError 原因 get_serving_url/Images API