python - 如何用python获取原始字符？

我正在使用 lxml 的 etree 制作个人 rss 阅读器，但在转换回原始字符时遇到问题。我期待看到“2014 年世界杯:在 Júlio César 的帮助下”:

url = 'rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
xml = etree.parse(url)
for x in xml.findall('.//item'):
    text = x.find('.//description').text
    print text
    # 'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
    text = text.encode('utf-8')
    print text
    # 'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
    text = text.decode('utf-8')
    # Error: 'UnicodeEncodeError: 'ascii' codec can't encode character....'

我已阅读Python's Unicode HOWTO以及Joel's Unicode Intro但我一定错过了一些东西。

编辑:几乎有很多感谢unutbu...只需要帮助转换\u2019:

content = 'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
html = LH.fromstring(content)
text = html.text_content()
print text
print(repr(text))
print text.encode('utf-8')

##RESULTS##
World Cup 2014: With Júlio César\u2019s Help
u'World Cup 2014: With J\xfalio C\xe9sar\\u2019s Help'
World Cup 2014: With Júlio César\u2019s Help

最佳答案

就在 UnicodeEncodeError 之前，我相信 text 是 unicode:

text = u'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
text = text.decode('utf-8')

重现错误消息:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 22: ordinal not in range(128)

<小时/>

在Python2中，lxml sometimes returns str for text, and sometimes unicode 。事实上，如果您运行此脚本，您会看到这种不幸的行为:

import lxml.etree as ET
import urllib2

url = 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
xml = ET.parse(urllib2.urlopen(url))
for x in xml.findall('.//item'):
    text = x.find('.//description').text
    print(type(text))

打印

<type 'str'>
<type 'str'>
<type 'str'>
<type 'unicode'>
<type 'str'>
<type 'unicode'>
...

但是，当文本由纯 ASCII 值(即 0 到 127 之间的字节值)组成时，它仅返回 str。

尽管一般情况下不应该对 str 进行编码，但对由以下内容组成的 str 进行编码使用 utf-8 的 0-127 (ASCII) 范围内的字节值保留 str。

因此，您实际上可以通过使用 utf-8 对两者进行编码，以相同的方式处理 str 和 unicode >，就好像 text 始终是 unicode。

由于 text 实际上是 HTML，因此下面我使用 lxml.html 将 HTML 简化为纯文本内容。这也可以是 str 或 unicode。然后在打印之前对该对象文本进行编码:

import lxml.etree as ET
import lxml.html as LH
import urllib2

url = 'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml'
xml = ET.parse(urllib2.urlopen(url))
for x in xml.findall('.//item'):
    content = x.find('.//description').text
    html = LH.fromstring(content)
    text = html.text_content()
    print(text.encode('utf-8'))

请注意，在 Python3 中，lxml 始终返回 unicode，因此恢复了思想的纯粹性。

<小时/>

UnicodeEncodeError 是如何发生的:

text = u'World Cup 2014: With J\xfalio C\xe9sar\u2019s Help'
text = text.decode('utf-8')
# Error: 'UnicodeEncodeError: 'ascii' codec can't encode character....'

首先请注意，即使您要求 Python 解码 文本，这也是一个 UnicodeEncodeError。另请注意，错误消息显示 Python 正在尝试使用 ascii 编解码器。

这是一个典型的迹象，表明问题与 Python2's automatic conversion between str and unicode. 有关。

假设text是unicode。如果你打电话

text.decode('utf-8')

那么你要求 Python 执行一个禁忌——解码 unicode。然而，Python2 会尝试先使用 ascii 编解码器对 unicode 进行静默编码，然后再使用 utf-8 进行解码。 str 和 unicode 之间的这种自动转换旨在方便处理仅在 ASCII 范围内的值的 str 和 unicode，但它会导致精神上的不清晰，因为它鼓励程序员忘记 str 和 unicode 之间的差异，它仅有时有效 - 当值在 ASCII 范围内时。当值超出 ASCII 范围时，您会收到错误 - 这就是您所遇到的情况。

在Python3中，bytes和str之间没有自动转换(或者用Python2的说法是str和unicode，分别)。当您尝试编码 bytes 或解码 str 时，Python 只会引发错误。精神清晰度得以恢复，但代价是迫使程序员注意类型。然而，正如这个问题所示，即使使用 Python2，这种成本也是不可避免的。

关于python - 如何用python获取原始字符？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24497802/

python - 如何用python获取原始字符？

上一篇：python - 使用 BeautifulSoup 从单个博客存档页面提取多个帖子，无需脚本

下一篇：python - 在 Windows Azure 上配置 Python 3.4 和 Django