java - 如何在Java中查找无法存储在MySQL “utf8”列中的字符

我使用 MySQL 5.7，并且有一个表，其中有一列使用“utf8”字符集。不幸的是它不是 utf8mb4，因此当我的应用尝试插入超出“utf8”范围的字符(例如表情符号)时，我总是会收到错误。

不幸的是，我无法很快将字符集更改为“utf8mb4”，所以我想知道是否有可能在插入表之前检测到那些导致错误发生的字符，并让我们的客户知道他们不能使用它们。

我在某处读到任何超出 U+0000 到 U+FFFF 范围的内容都会导致错误发生。我的应用程序是用 Java 8 实现的。所以，我的问题是:如何编写可以从 String 实例中找到此类有问题的字符的代码？是the following code做我想做的事？

import java.util.Set;
import java.util.stream.Collectors;

class Utf8Mb3Validator {

    /**
     * finds characters which can’t be stored in a MySQL “utf8” column out of a given String.
     *
     * @param input a String which you want to check
     * @return a Set which contains strings that can't be inserted into MySQL "utf8" columns
     */
    Set<String> findProblematicStrings(String input) {
        // References:
        // https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb3.html
        // https://www.oracle.com/technetwork/java/javase/downloads/supplementary-142654.html?printOnly=1
        // https://stackoverflow.com/q/56800767/3591946
        return input
                .codePoints() // get Unicode code points
                .filter(codePoint -> Character.charCount(codePoint) > 1) // search for non-BMP characters
                .mapToObj(codePoint -> new String(Character.toChars(codePoint))) // convert code points into Strings
                .collect(Collectors.toSet());
    }
}

我也将这个问题发布到MySQL论坛:https://forums.mysql.com/read.php?39,675862,675862#msg-675862

最佳答案

事实上 MySQL 的 utf8 在当时是正确的，因为 UTF-8 多字节序列最多只有 3 个字节。但是 Unicode 有了更多的符号，UTF-8 也在增长。只有utf8mb4可以做到。

但是最多 3 个字节都可以:

return input
      .codePoints()
      .filter(codePoint -> codePoint >= 256) // Optional heuristic optimisation
      .mapToObj(codePoint -> new String(Character.toChars(codePoint)))
      .filter(cpString -> cpString.getBytes(StandardCharsets.UTF_8).length > 3)
      .collect(Collectors.toSet())

或者只是全部 codepoints above U+FFFF :

return input
      .codePoints()
      .filter(codePoint -> codePoint >= 0x1_0000)
      .mapToObj(codePoint -> new String(Character.toChars(codePoint)))
      .collect(Collectors.toSet());

我诚实地承认，我需要研究是否也可以使用 Character.charCount(codePoint)，因为它检查 UTF-16 中的代理对，而不是 UTF 中的字节数-8。

Character.getName(codePoint) 可能有助于替换代码点(如果字段有足够长的大小)。

关于java - 如何在Java中查找无法存储在MySQL “utf8”列中的字符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56800767/

java - 如何在Java中查找无法存储在MySQL “utf8”列中的字符

上一篇：mysql - 如何在不破坏mysql中表排列的情况下选择表中的特定eid

下一篇：mysql - 为什么 sql 不允许我创建这个表？