java - 如何转换一串俄语西里尔字母?

标签 java encoding

我正在解析 mp3 标签。

String artist - 我不知道编码是什么

Ïåñíÿ ïðî íàäåæäó - 俄语示例字符串 "Песня про надежду"

我使用 http://code.google.com/p/juniversalchardet/

代码:

String GetEncoding(String text) throws IOException {
        byte[] buf = new byte[4096];


        InputStream fis = new ByteArrayInputStream(text.getBytes());


        UniversalDetector detector = new UniversalDetector(null);

        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        detector.reset();
        return encoding;
    }

隐蔽

new String(text.getBytes(encoding), "cp1251"); - 但这不起作用。

如果我使用utf-16

new String(text.getBytes("UTF-16"), "cp1251") 返回 "юя П е с н я п р о н а д е ж д у"空格 - 不是是字符空间

编辑:

这是第一个读取的字节

byte[] abyFrameData = new byte[iTagSize];
oID3DIS.readFully(abyFrameData);
ByteArrayInputStream oFrameBAIS = new ByteArrayInputStream(abyFrameData);

String s = new String(abyFrameData, "????");

最佳答案

Java 字符串是 UTF-16。所有其他编码都可以使用字节序列表示。要解码字符数据,您必须在首次创建字符串时提供编码。如果您有损坏的字符串,那就太晚了。

假设 ID3,规范定义了编码规则。例如,ID3v2.4.0可能会限制通过扩展 header 使用的编码:

q - Text encoding restrictions

   0    No restrictions
   1    Strings are only encoded with ISO-8859-1 [ISO-8859-1] or
        UTF-8 [UTF-8].

编码处理在文档中进一步定义:

If nothing else is said, strings, including numeric strings and URLs, are represented as ISO-8859-1 characters in the range $20 - $FF. Such strings are represented in frame descriptions as <text string>, or <full text string> if newlines are allowed. If nothing else is said newline character is forbidden. In ISO-8859-1 a newline is represented, when allowed, with $0A only.

Frames that allow different types of text encoding contains a text encoding description byte. Possible encodings:

 $00   ISO-8859-1 [ISO-8859-1]. Terminated with $00.
 $01   UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All
       strings in the same frame SHALL have the same byteorder.
       Terminated with $00 00.
 $02   UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM.
       Terminated with $00 00.
 $03   UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with
       $00.

使用像 InputStreamReader 这样的转码类或者(在这种情况下更有可能) String(byte[],Charset) 解码数据的构造函数。另见 Java: a rough guide to character encoding .


解析 ID3v2.4.0 数据结构的字符串组件是这样的:

//untested code
public String parseID3String(DataInputStream in) throws IOException {
  String[] encodings = { "ISO-8859-1", "UTF-16", "UTF-16BE", "UTF-8" };
  String encoding = encodings[in.read()];
  byte[] terminator =
      encoding.startsWith("UTF-16") ? new byte[2] : new byte[1];
  byte[] buf = terminator.clone();
  ByteArrayOutputStream buffer = new ByteArrayOutputStream();
  do {
    in.readFully(buf);
    buffer.write(buf);
  } while (!Arrays.equals(terminator, buf));
  return new String(buffer.toByteArray(), encoding);
}

关于java - 如何转换一串俄语西里尔字母?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6017004/

相关文章:

java - 使用HashTableMap计算第n个斐波那契数

java - 使用 Java EE 6 Bean 验证

PHP 修剪和空间不起作用

html - 在 XML 属性中转义特殊(HTML 标记)字符?

java - 在Java中获取键盘/鼠标输入

java - 从目录创建依赖项

java - Spring环境无法解析占位符

c++ - 回声消除

Python:使用 BeautifulSoup 解析 anchor 文本

php - 应该执行哪些字符替换以使 base 64 编码 URL 安全?