java - 在 Java JNI 中获取真正的 UTF-8 字符

标签 java encoding utf-8 java-native-interface

有没有一种简单的方法可以在 JNI 代码中将 Java 字符串转换为真正的 UTF-8 字节数组?

不幸的是,GetStringUTFChars() 几乎 完成了所需的但不完全是,它返回一个“修改过的”UTF-8 字节序列。主要区别在于修改后的 UTF-8 不包含任何空字符(因此您可以将其视为 ANSI C 空终止字符串),但另一个区别似乎是如何处理 Unicode 增补字符,例如表情符号。

像 U+1F604 "SMILING FACE WITH OPEN MOUTH AND SMILING EYES"这样的字符被存储为代理对(两个 UTF-16 字符 U+D83D U+DE04)并且有一个 4 字节的 UTF-8 等价物F0 9F 98 84,这是我在 Java 中将字符串转换为 UTF-8 时得到的字节序列:

    char[] c = Character.toChars(0x1F604);
    String s = new String(c);
    System.out.println(s);
    for (int i=0; i<c.length; ++i)
        System.out.println("c["+i+"] = 0x"+Integer.toHexString(c[i]));
    byte[] b = s.getBytes("UTF-8");
    for (int i=0; i<b.length; ++i)
        System.out.println("b["+i+"] = 0x"+Integer.toHexString(b[i] & 0xFF));

上面的代码打印如下:

😄 c[0] = 0xd83d c[1] = 0xde04 b[0] = 0xf0 b[1] = 0x9f b[2] = 0x98 b[3] = 0x84

但是,如果我将“s”传递给 native JNI 方法并调用 GetStringUTFChars(),我将获得 6 个字节。每个代理对字符都被独立地转换为 3 字节序列:

JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
{
    const char* sBytes = env->GetStringUTFChars(_s, NULL);
    for (int i=0; sBytes[i]!=0; ++i)
        fprintf(stderr, "%d: %02x\n", i, sBytes[i]);
    env->ReleaseStringUTFChars(_s, sBytes);
    return result;
}

0: ed 1: a0 2: bd 3: ed 4: b8 5: 84

Wikipedia UTF-8 article建议 GetStringUTFChars() 实际上返回 CESU-8 而不是 UTF-8。这反过来导致我的 native Mac 代码崩溃,因为它不是有效的 UTF-8 序列:

CFStringRef str = CFStringCreateWithCString(NULL, path, kCFStringEncodingUTF8);
CFURLRef url = CFURLCreateWithFileSystemPath(NULL, str, kCFURLPOSIXPathStyle, false);

我想我可以更改我所有的 JNI 方法以采用 byte[] 而不是 String 并在 Java 中进行 UTF-8 转换,但这看起来有点难看,是否有更好的解决方案?

最佳答案

这在 Java 文档中有清楚的解释:

JNI Functions

GetStringUTFChars

const char * GetStringUTFChars(JNIEnv *env, jstring string, jboolean *isCopy);

Returns a pointer to an array of bytes representing the string in modified UTF-8 encoding. This array is valid until it is released by ReleaseStringUTFChars().

Modified UTF-8

The JNI uses modified UTF-8 strings to represent various string types. Modified UTF-8 strings are the same as those used by the Java VM. Modified UTF-8 strings are encoded so that character sequences that contain only non-null ASCII characters can be represented using only one byte per character, but all Unicode characters can be represented.

All characters in the range \u0001 to \u007F are represented by a single byte, as follows:

table1

The seven bits of data in the byte give the value of the character represented.

The null character ('\u0000') and characters in the range '\u0080' to '\u07FF' are represented by a pair of bytes x and y:

table2

The bytes represent the character with the value ((x & 0x1f) << 6) + (y & 0x3f).

Characters in the range '\u0800' to '\uFFFF' are represented by 3 bytes x, y, and z:

table3

The character with the value ((x & 0xf) << 12) + ((y & 0x3f) << 6) + (z & 0x3f) is represented by the bytes.

Characters with code points above U+FFFF (so-called supplementary characters) are represented by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes. This means, supplementary characters are represented by six bytes, u, v, w, x, y, and z:

table4

The character with the value 0x10000+((v&0x0f)<<16)+((w&0x3f)<<10)+(y&0x0f)<<6)+(z&0x3f) is represented by the six bytes.

The bytes of multibyte characters are stored in the class file in big-endian (high byte first) order.

There are two differences between this format and the standard UTF-8 format. First, the null character (char)0 is encoded using the two-byte format rather than the one-byte format. This means that modified UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats of standard UTF-8 are used. The Java VM does not recognize the four-byte format of standard UTF-8; it uses its own two-times-three-byte format instead.

For more information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0.

由于U+1F604是增补字符,Java不支持UTF-8的4字节编码格式,所以U+1F604通过UTF-16代理对U+D83D U+DE04的编码在修改后的UTF-8中表示。每个代理项使用 3 个字节,因此总共 6 个字节。

那么,回答你的问题...

Is there an easy way to convert a Java string to a true UTF-8 byte array in JNI code?

您可以:

  1. 使用 GetStringChars()获取原始的 UTF-16 编码字符,然后从中创建您自己的 UTF-8 字节数组。从 UTF-16 到 UTF-8 的转换是一种非常简单的手动实现算法,或者您可以使用您的平台或第 3 方库提供的任何预先存在的实现。

  2. 让您的 JNI 代码回调到 Java 中以调用 String.getBytes(String charsetName) 编码 jstring 的方法对象到 UTF-8 字节数组,例如:

    JNIEXPORT void JNICALL Java_EmojiTest_nativeTest(JNIEnv *env, jclass cls, jstring _s)
    {
        const jclass stringClass = env->GetObjectClass(_s);
        const jmethodID getBytes = env->GetMethodID(stringClass, "getBytes", "(Ljava/lang/String;)[B");
    
        const jstring charsetName = env->NewStringUTF("UTF-8");
        const jbyteArray stringJbytes = (jbyteArray) env->CallObjectMethod(_s, getBytes, charsetName);
        env->DeleteLocalRef(charsetName);
    
        const jsize length = env->GetArrayLength(stringJbytes);
        const jbyte* pBytes = env->GetByteArrayElements(stringJbytes, NULL); 
    
        for (int i = 0; i < length; ++i)
            fprintf(stderr, "%d: %02x\n", i, pBytes[i]);
    
        env->ReleaseByteArrayElements(stringJbytes, pBytes, JNI_ABORT); 
        env->DeleteLocalRef(stringJbytes);
    }
    

The Wikipedia UTF-8 article suggests that GetStringUTFChars() actually returns CESU-8 rather than UTF-8

Java 的修改后的 UTF-8 与 CESU-8 不完全相同:

CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).

关于java - 在 Java JNI 中获取真正的 UTF-8 字符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32205446/

相关文章:

java - 无法从JAVA中同一类的另一个方法调用方法

video - 使用 ffmpeg 将静止图像编码为 mpeg-2 视频时如何避免泵送伪影

php - 如何在 PHP 中输​​出 Excel 可以正确读取的 UTF-8 CSV?

Java - 解析带有可选秒的日期

java - 我想挑选一个从方法扫描的值并获得该方法的 3 个实例的总和

java - 你如何在java中获取进程的ID?

java - jsonrpc 响应中的西里尔字母符号

mysql - 如何更改 mysql 设置,以便它是所有内容的默认 UTF-8?

php - 解码字符串中的 "=C3=A4"

java - 仅当 Java 中的字符串值不是 UTF-8 时才对其进行编码