unicode - 对 Unicode 字符进行 URL 编码的正确方法是什么？

我知道非标准 %uxxxx 方案，但这似乎不是一个明智的选择，因为该方案已被 W3C 拒绝。

一些有趣的例子:

心形角色。如果我在浏览器中输入以下内容:

http://www.google.com/search?q=♥

然后复制粘贴，我看到这个网址

http://www.google.com/search?q=%E2%99%A5

这使得 Firefox(或 Safari)看起来像是在这样做。

urllib.quote_plus(x.encode("latin-1"))
'%E2%99%A5'

这是有道理的，除了那些不能用 Latin-1 编码的东西，比如三点字符。

…

如果我输入 URL

http://www.google.com/search?q=…

进入我的浏览器，然后复制并粘贴，我得到

http://www.google.com/search?q=%E2%80%A6

回来了。这似乎是这样做的结果

urllib.quote_plus(x.encode("utf-8"))

这是有道理的，因为……不能用 Latin-1 编码。

但是我不清楚浏览器如何知道是使用 UTF-8 还是 Latin-1 进行解码。

因为这似乎不明确:

In [67]: u"…".encode('utf-8').decode('latin-1')
Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'

有效，所以我不知道浏览器如何确定是否使用 UTF-8 还是 Latin-1 进行解码。

对于我需要处理的特殊字符，正确的做法是什么？

最佳答案

我总是使用 UTF-8 进行编码。来自 Wikipedia page on percent encoding :

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

似乎因为过去还有其他可接受的 URL 编码方式，浏览器尝试了多种解码 URI 的方法，但如果您是进行编码的人，则应该使用 UTF-8。

关于unicode - 对 Unicode 字符进行 URL 编码的正确方法是什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/912811/

unicode - 对 Unicode 字符进行 URL 编码的正确方法是什么？

上一篇：c# - 构建 MVC3 EditorFor 模板时有没有办法访问 DataAnnotations？

下一篇：php - 将 call_user_func() 与对象一起使用