我正在尝试阅读以下 URL 的提要:
http://www.chinanews.com/rss/scroll-news.xml
使用 request module 。但我得到的东西有 ���� ʷ����)������(�й�)����
.
审查XML
我看到编码被设置为 <?xml version="1.0" encoding="gb2312"?>
但是尝试将编码设置为 gb2312
,我收到未知的编码错误。
request({
url: "http://www.chinanews.com/rss/scroll-news.xml",
method: "GET",
headers: {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Host": "www.chinanews.com",
"Accept-Language": "en-GB,en-US;q=0.8,en;q=0.6"
},
"gzip": true,
"encoding": "utf8"
}, (err, resp, data) => {
console.log(data);
});
有没有办法可以获取数据而不管其编码如何?我应该如何处理这个问题?
最佳答案
您错过了 character encoding 的概念.
var iconv=require('iconv-lite'), request=require('request');
request({
url: "http://www.chinanews.com/rss/scroll-news.xml",
method: "GET",
headers: {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Host": "www.chinanews.com",
"Accept-Language": "" // client accept language
},
gzip: true,
encoding: null // or 'ascii'
}, (err, resp, body) => {
console.log(iconv.decode(Buffer.from(body, 'ascii'), 'gb2312'));
});
chunk
是node.js中的一个Buffer
实例。据官方documention ,只有
'ascii' - For 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.
'utf8' - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.
'utf16le' - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.
'ucs2' - Alias of 'utf16le'.
'base64' - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC4648, Section 5.
'latin1' - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).
'binary' - Alias for 'latin1'.
'hex' - Encode each byte as two hexadecimal characters.
目前supported由node.js 包含。使用不 natively 的编码由node.js支持,使用iconv , iconv-lite或other libraries抓取字符映射表。这与this answer非常相似。 .
Accept-Language
表示客户端接受的语言。 en-gb
代表英语(英国)
,但不是中文。根据 RFC 7231,中文为 zh-cn, zh
。
关于javascript - 下载 XML 源时处理多种编码方案,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47457143/