python - 正则表达式 - 提取列表中以大写字母开头的子字符串，并带有法语特殊符号

我有一组像这样的法语字符串:

text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"

我想以大写字母开头的子字符串提取到一个列表中，如下所示:

list = ["Français", "Langues bantoues", "Presse écrite", "Gabon", "Particularité linguistique"]

我确实尝试过类似的方法，但它不需要以下单词，并且由于法语符号而停止。

import re
pattern = "([A-Z][a-z]+)"

text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"

list = re.findall(pattern, text)
list

输出['Fran', 'Langues', 'Presse', 'Gabon', 'Particularit']不幸的是，我没有设法在论坛上找到解决方案。

最佳答案

由于这与特定的 Unicode 字符处理有关，我建议使用 PyPi regex module (使用 pip install regex 安装)然后你可以使用

import regex
text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"
matches = regex.split(r'(?!\A)\b(?=\p{Lu})', text)
print( list(map(lambda x: x.strip(), matches)) )
# => ['Français', 'Langues bantoues', 'Presse écrite', 'Gabon', 'Particularité linguistique']

见online Python demo和 regex demo .细节:

(?!\A) - 字符串开头以外的位置

\b - 一个词的边界

(?=\p{Lu}) - 要求下一个字符为 Unicode 大写字母的正向前瞻。

请注意 map(lambda x: x.strip(), matches)用于从结果 block 中去除多余的空白。
你 可以用 re 做到这一点, 也是 :

import re, sys
text = "Français Langues bantoues Presse écrite Gabon Particularité linguistique"
pLu = '[{}]'.format("".join([chr(i) for i in range(sys.maxunicode) if chr(i).isupper()]))
matches = re.split(fr'(?!\A)\b(?={pLu})', text)
print( list(map(lambda x: x.strip(), matches)) )
# => ['Français', 'Langues bantoues', 'Presse écrite', 'Gabon', 'Particularité linguistique']

见 this Python demo ，但请记住，支持的 Unicode 大写字母数量因版本而异，使用 PyPi 正则表达式模块使其更加一致。

关于python - 正则表达式 - 提取列表中以大写字母开头的子字符串，并带有法语特殊符号，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67197407/

python - 正则表达式 - 提取列表中以大写字母开头的子字符串，并带有法语特殊符号

上一篇：proof - 在精益证明的目标中应用函数

下一篇：typescript - 在返回类型中使用通用键