APNIC Whois 数据库包含许多中国实体的条目,这些条目使用 ~{
和 ~}
括起来的某种编码。例如:
$ whois 211.68.92.0 | grep ^descr:
descr: ~{146{J5QiJR;y4!?FQ'QP>?T:~}(~{VP9z~})
descr: Bell Labs Research China
descr: Beijing 100080, China
有人知道这是什么吗?某种编码?我的第一个猜测是 Punycode,但很快意识到它不会包含其中的一些特殊字符。
我也在一些网页上发现了这种编码,比如that .
出于好奇,对此进行解码会很有趣。
编辑:在 RFC 1842 中找到。
For an arbitrary mixed text with both Chinese coded text strings and ASCII text strings, we designate to two distinguishable text modes, ASCII mode and HZ mode, as the only two states allowed in the text. At any given time, the text is in either one of these two modes or in the transition from one to the other. In the HZ mode, only printable ASCII characters (0x21-0x7E) are meanful with the size of basic text unit being two bytes long.
In the ASCII mode, the size of basic text unit is one (1) byte with the exception '~~', which is the special sequence representing the ASCII character '~'. In both ASCII mode and HZ mode, '~' leads an escape sequence. However, as HZ mode has basic size of text unit being 2 bytes long, only the '~' character which appears at the first byte of the the two-byte character frame are considered as the start of an escape sequence.
The default mode is ASCII mode. Each line of text starts with the default ASCII mode. Therefore, all Chinese character strings are to be enclosed with '~{' and '~}' pair in the same text line.
The escape sequences defined are as the following:
~{ ---- escape from ASCII mode to GB2312 HZ mode ~} ---- escape from HZ mode to ASCII mode ~~ ---- ASCII character '~' in ASCII mode ~\n ---- line continuation in ASCII mode ~[!-z|] ---- reserved for future HZ mode character sets
A few examples of the 7 bit representation of Chinese GB coded test taken directly from [Lee89] are listed as the following:
Example 1: (Suppose there is no line size limit.) This sentence is in ASCII. The next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye.
Example 2: (Suppose the maximum line size is 42.) This sentence is in ASCII. The next sentence is in GB.~{<:Ky2;S{#,~}~ ~{NpJ)l6HK!#~}Bye.
Example 3: (Suppose a new line is started for every mode switch.) This sentence is in ASCII. The next sentence is in GB.~ ~{<:Ky2;S{#,NpJ)l6HK!#~}~ Bye.
我将如何在 python3 中对此进行解码?
最佳答案
正如 OP 发现的那样,编码是 RFC1842 中定义的混合 ASCII 和中文文本的 HZ 编码。 .
codecs标准库中的模块提供此编码为“hz”,别名为“hzgb”、“hz-gb”和“hz-gb-2312”。
>>> s = "~{146{J5QiJR;y4!?FQ'QP>?T:~}(~{VP9z~})"
>>> bs = s.encode('ascii')
>>> bs.decode('hz')
'贝尔实验室基础科学研究院(中国)'
关于python - 哪种编码?由波浪号 ~ 和花括号 {} 括起来的字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57815990/