python - 哪种编码?由波浪号 ~ 和花括号 {} 括起来的字符串

标签 python encoding character-encoding cjk

APNIC Whois 数据库包含许多中国实体的条目,这些条目使用 ~{~} 括起来的某种编码。例如:

$ whois 211.68.92.0 | grep ^descr:
descr:          ~{146{J5QiJR;y4!?FQ'QP>?T:~}(~{VP9z~})
descr:          Bell Labs Research China
descr:          Beijing 100080, China

有人知道这是什么吗?某种编码?我的第一个猜测是 Punycode,但很快意识到它不会包含其中的一些特殊字符。

我也在一些网页上发现了这种编码,比如that .

出于好奇,对此进行解码会很有趣。

编辑:在 RFC 1842 中找到。

For an arbitrary mixed text with both Chinese coded text strings and ASCII text strings, we designate to two distinguishable text modes, ASCII mode and HZ mode, as the only two states allowed in the text. At any given time, the text is in either one of these two modes or in the transition from one to the other. In the HZ mode, only printable ASCII characters (0x21-0x7E) are meanful with the size of basic text unit being two bytes long.

In the ASCII mode, the size of basic text unit is one (1) byte with the exception '~~', which is the special sequence representing the ASCII character '~'. In both ASCII mode and HZ mode, '~' leads an escape sequence. However, as HZ mode has basic size of text unit being 2 bytes long, only the '~' character which appears at the first byte of the the two-byte character frame are considered as the start of an escape sequence.

The default mode is ASCII mode. Each line of text starts with the default ASCII mode. Therefore, all Chinese character strings are to be enclosed with '~{' and '~}' pair in the same text line.

The escape sequences defined are as the following:

~{       ---- escape from ASCII mode to GB2312 HZ mode
~}       ---- escape from HZ mode to ASCII mode
~~       ---- ASCII character '~' in ASCII mode
~\n      ---- line continuation in ASCII mode
~[!-z|]  ---- reserved for future HZ mode character sets

A few examples of the 7 bit representation of Chinese GB coded test taken directly from [Lee89] are listed as the following:

Example 1: (Suppose there is no line size limit.) This sentence is in ASCII. The next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye.

Example 2: (Suppose the maximum line size is 42.) This sentence is in ASCII. The next sentence is in GB.~{<:Ky2;S{#,~}~ ~{NpJ)l6HK!#~}Bye.

Example 3: (Suppose a new line is started for every mode switch.) This sentence is in ASCII. The next sentence is in GB.~ ~{<:Ky2;S{#,NpJ)l6HK!#~}~ Bye.

我将如何在 python3 中对此进行解码?

最佳答案

正如 OP 发现的那样,编码是 RFC1842 中定义的混合 ASCII 和中文文本的 HZ 编码。 .

codecs标准库中的模块提供此编码为“hz”,别名为“hzgb”、“hz-gb”和“hz-gb-2312”。

>>> s = "~{146{J5QiJR;y4!?FQ'QP>?T:~}(~{VP9z~})"
>>> bs = s.encode('ascii')
>>> bs.decode('hz')  
'贝尔实验室基础科学研究院(中国)'

关于python - 哪种编码?由波浪号 ~ 和花括号 {} 括起来的字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57815990/

相关文章:

python - 如何创建一个等于 15 分钟前的 DateTime?

php - 如何在 mysql 中存储 html_entites 作为它们的表示而不是它们的值

php - = 登录 Gmail 邮件正文

mysql - 法语/英语和日语的 SQL 条目

python - 使用 Python 在 Linux 中将文件写入 U 盘?

python - 使用 SVD 进行降维,指定要保持的能量

python - 如何使用装饰器类来装饰实例方法?

encoding - BCD格式在编程中如何使用?

java - InputStreamReader 缓冲问题

jquery - 如何在 .getJSON jQuery 中设置编码