python - urllib2、Google App Engine 和 unicode 问题

嘿伙计们，我刚刚学习谷歌应用引擎，所以我遇到了很多问题......

我目前的困境是这样的。我有一个数据库，

class Website(db.Model):
    web_address = db.StringProperty()
    company_name = db.StringProperty()
    content = db.TextProperty()
    div_section = db.StringProperty()
    local_links = db.StringProperty()
    absolute_links = db.BooleanProperty()
    date_updated = db.DateTimeProperty()

我遇到的问题是内容属性。

我正在使用 db.TextProperty()，因为我需要存储大于 500 字节的网页内容。

我遇到的问题是 urllib2.readlines() 格式为 unicode。当放入 TextProperty() 时，它正在转换为 ASCII。一些字符大于 128，它会抛出 UnicodeDecodeError。

有没有简单的方法可以绕过这个？大多数情况下，我不关心那些字符...

我的错误是:

Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/init.py", line 511, in call handler.get(*groups) File "/base/data/home/apps/game-job-finder/1.346504560470727679/main.py", line 61, in get x.content = website_data_joined File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py", line 542, in set value = self.validate(value) File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py", line 2407, in validate value = self.data_type(value) File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore_types.py", line 1006, in new return super(Text, cls).new(cls, arg, encoding) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2124: ordinal not in range(128)

最佳答案

从 readlines 返回的行似乎不是 unicode 字符串，而是字节字符串(即包含潜在非 ASCII 字符的 str 实例)。这些字节是在 HTTP 响应正文中接收到的原始数据，并且将根据使用的编码表示不同的字符串。在将它们视为文本(字节!=字符)之前，它们需要被“解码”。

如果编码是 UTF-8，这段代码应该可以正常工作:

f = urllib2.open('http://www.google.com')
website = Website()
website.content = db.Text(f.read(), encoding = 'utf-8-sig')    # 'sig' deals with BOM if present

请注意，实际编码因网站而异(有时甚至因页面而异)。使用的编码应包含在 HTTP 响应的 Content-Type header 中(请参阅 this question 了解如何获取它)，但如果不是，它可能包含在 HTML header 的元标记中(其中正确提取案例要棘手得多):

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

请注意，有些网站没有指定编码，或者指定了错误的编码。

如果除了 ASCII 之外你真的不关心任何字符，你可以忽略它们并完成它:

f = urllib2.open('http://www.google.com')
website = Website()
content = unicode(f.read(), errors = 'ignore')    # Ignore characters that cause errors
website.content = db.Text(content)    # Don't need to specify an encoding since content is already a unicode string

关于python - urllib2、Google App Engine 和 unicode 问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/4290125/

python - urllib2、Google App Engine 和 unicode 问题

上一篇：python - 如何组织3D游戏的结构？

下一篇：python - Django 或 mod_wsgi 在运行时会修改 sys.path 吗？