regex - 删除 PostgreSQL 中的所有 Unicode 空格分隔符？

我要trim()一列并替换任何多个空格和 Unicode space separators到单个空间。背后的想法是清理用户名，防止 2 个用户使用欺骗性名称 foo bar (空格 u+20)对比 foo bar (无间断空间 u+A0)。
到目前为止，我一直在使用 SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g');它删除了空格、制表符和回车符，但缺乏对 Unicode space separators 的支持.
我会添加到正则表达式 \h ，但 PostgreSQL 不支持它(都不支持 \p{Zs}):

SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');

Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence

我们在 Debian 10 docker 容器中运行 PostgreSQL 12 ( 12.2-2.pgdg100+1 )，使用 UTF-8 编码，并支持用户名中的表情符号。
我有办法实现类似的东西吗？

最佳答案

基于 Posix“空格”字符类(Postgres 正则表达式中的 class shorthand \s )，UNICODE “空格”，一些类似空格的“格式字符”，以及一些额外的非打印字符(最后从 Wiktor 的帖子中添加了两个)，我浓缩了这个自定义字符类:

'[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]'

所以使用:

SELECT trim(regexp_replace('some string', '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]+', ' ', 'g'));

注:trim()紧随其后 regexp_replace() ，所以它涵盖了转换的空间。
包含基本空间类很重要 \s ([[:space:]] 的缩写，涵盖所有当前(和 future )的基本空格字符。
我们可能会包含更多字符。或者首先剥离用 4 个字节编码的所有字符。因为UNICODE是黑暗的，充满了恐怖。
考虑这个演示:

SELECT d AS decimal, to_hex(d) AS hex, chr(d) AS glyph
     , '\u' || lpad(to_hex(d), 4, '0') AS unicode
     , chr(d) ~ '\s' AS in_posix_space_class
     , chr(d) ~ '[\s\u00a0\u180e\u2007\u200b-\u200f\u202f\u2060\ufeff]' AS in_custom_class
FROM  (
   -- TAB, SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL, NARROW NO-BREAK SPACE
   -- MEDIUM MATHEMATICAL SPACE, WORD JOINER, IDEOGRAPHIC SPACE, ZERO WIDTH NON-BREAKING SPACE
   SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
   UNION ALL
   SELECT generate_series (8192, 8202) AS dec  -- UNICODE "Spaces"
   UNION ALL
   SELECT generate_series (8203, 8207) AS dec  -- First 5 space-like UNICODE "Format characters"
   ) t(d)
ORDER  BY d;

 decimal | hex  |  glyph   | unicode | in_posix_space_class | in_custom_class 
---------+------+----------+---------+----------------------+-----------------
       9 | 9    |          | \u0009  | t                    | t
      32 | 20   |          | \u0020  | t                    | t
     160 | a0   |          | \u00a0  | f                    | t
    5760 | 1680 |          | \u1680  | t                    | t
    6158 | 180e | ᠎        | \u180e  | f                    | t
    8192 | 2000 |          | \u2000  | t                    | t
    8193 | 2001 |          | \u2001  | t                    | t
    8194 | 2002 |          | \u2002  | t                    | t
    8195 | 2003 |          | \u2003  | t                    | t
    8196 | 2004 |          | \u2004  | t                    | t
    8197 | 2005 |          | \u2005  | t                    | t
    8198 | 2006 |          | \u2006  | t                    | t
    8199 | 2007 |          | \u2007  | f                    | t
    8200 | 2008 |          | \u2008  | t                    | t
    8201 | 2009 |          | \u2009  | t                    | t
    8202 | 200a |          | \u200a  | t                    | t
    8203 | 200b |         | \u200b  | f                    | t
    8204 | 200c | ‌        | \u200c  | f                    | t
    8205 | 200d | ‍        | \u200d  | f                    | t
    8206 | 200e | ‎        | \u200e  | f                    | t
    8207 | 200f | ‏        | \u200f  | f                    | t
    8239 | 202f |          | \u202f  | f                    | t
    8287 | 205f |          | \u205f  | t                    | t
    8288 | 2060 | ⁠        | \u2060  | f                    | t
   12288 | 3000 | 　       | \u3000  | t                    | t
   65279 | feff |         | \ufeff  | f                    | t
(26 rows)

生成字符类的工具:

SELECT '[\s' || string_agg('\u' || lpad(to_hex(d), 4, '0'), '' ORDER BY d) || ']'
FROM  (
   SELECT unnest('{9,32,160,5760,6158,8239,8287,8288,12288,65279}'::int[])
   UNION ALL
   SELECT generate_series (8192, 8202)
   UNION ALL
   SELECT generate_series (8203, 8207)
   ) t(d)
WHERE  chr(d) !~ '\s'; -- not covered by \s

[\s\u00a0\u180e\u2007\u200b\u200c\u200d\u200e\u200f\u202f\u2060\ufeff]

分贝<> fiddle here
相关，有更多解释:

Trim trailing spaces with PostgreSQL

关于regex - 删除 PostgreSQL 中的所有 Unicode 空格分隔符？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63302656/

regex - 删除 PostgreSQL 中的所有 Unicode 空格分隔符？

上一篇：python - 如何在没有 setup.py 的情况下为 python 项目创建 deb 包

下一篇：java - 如何使用 Hibernate 保留枚举类型字段？