java - 为什么我不能在 Java 的 char 数组中存储日文 UTF-8 字符？

我有一个字符串“1234567(Asics (アシックスパーキング) )”。它具有 unicode 字符，有些是 ASCII 的一部分，有些则不是。 java 所做的是它为 ASCII 字符占用一个字节，为其他 unicode 字符占用两个字节。

我程序的某些部分无法处理这种格式的字符串。所以我想将这些值编码为转义序列。

所以字符串

"1234567(Asics (アシックスワーキング) )"

将映射到

"\u0031\u0032\u0033\u0034\u0035\u0036\u0037\u0028\u0041\u0073\u0069\u0063\u0073\u0020\u0028\u30a2\u30b7\u30c3\u30af\u30b9\u30ef\u30fc\u30ad\u30f3\u30b0\u0029\u0020\u0029"

我写了这个函数来做这个:-

public static String convertToEscaped(String utf8) throws java.lang.Exception
    {
    char[] str = utf8.toCharArray();
    StringBuilder unicodeStringBuilder = new StringBuilder();
    for(int i = 0; i < str.length; i++){
    char charValue = str[i];
    int intValue = (int) charValue;
    String hexValue = Integer.toHexString(intValue);
    unicodeStringBuilder.append("\\u");
    for (int length = hexValue.length(); length < 4; length++) {
        unicodeStringBuilder.append("0");
    }
    unicodeStringBuilder.append(hexValue);
    }
    return unicodeStringBuilder.toString();
    }

这在我的程序之外工作正常，但在我的程序内部引起了问题。这发生在行 char[] str = utf8.toCharArray(); 不知何故，我丢失了我的日文 unicode 字符，这是因为 t 在 char 数组中将这些字符分成 2 个。

所以我决定改用 byte []。

    public static String convertToEscaped(String utf8) throws java.lang.Exception
    {
    byte str[] = utf8.getBytes();
    StringBuilder unicodeStringBuilder = new StringBuilder();
    for(int i = 0; i < str.length - 1 ; i+=2){
    int intValue = (int) str[i]* 256 + (int)str[i+1];
    String hexValue = Integer.toHexString(intValue);
    unicodeStringBuilder.append("\\u");
    for (int length = hexValue.length(); length < 4; length++) {
        unicodeStringBuilder.append("0");
    }
    unicodeStringBuilder.append(hexValue);
    }
    return unicodeStringBuilder.toString();
    }

Output : \u3132\u3334\u3536\u3738\u2841\u7369\u6373\u2028\uffffe282\uffffa1e3\uffff81b7\uffffe283\uffff82e3\uffff81af\uffffe282\uffffb8e3\uffff82af\uffffe283\uffffbbe3\uffff81ad\uffffe283\uffffb2e3\uffff81b0\u2920

但这也是错误的，因为我将两个单字节字符合并为一个。我可以做些什么来克服这个问题？

最佳答案

我不知道你的其他代码的具体要求。但我的建议是不要重新发明轮子并使用 API 的内置编码功能。

例如，根据您需要的字节序，使用 StandardCharsets.UTF_16BE 或 StandardCharsets.UTF_16LE 调用 getBytes:

String s = "1234567(Asics (アシックスワーキング) )";

byte[] utf8 = s.getBytes(StandardCharsets.UTF_8);
byte[] utf16 = s.getBytes(StandardCharsets.UTF_16BE); // high order byte first

System.out.println(s.length()); // 28
System.out.println(utf8.length); // 48
System.out.println(utf16.length); // 56 (2 bytes for each char)

关于java - 为什么我不能在 Java 的 char 数组中存储日文 UTF-8 字符？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42326953/

java - 为什么我不能在 Java 的 char 数组中存储日文 UTF-8 字符？

上一篇：java - 为什么延迟获取不工作 JPA

下一篇：JavaBeans (EJB) 模块没有配置任何企业 Bean