python - 如何标记化，拆分相邻的数字字母？

我正在尝试将类似 hello world123 的内容标记为 hello、world 和 123。我认为我拥有所需的两部分代码，但无法将它们组合起来以正确地标记化。

(?u)\b\w+\b
(?<=\D)(?=\d)|(?<=\d)(?=\D)

最佳答案

你可以使用

import re
s = "hello world123"
print(re.findall(r'[^\W\d_]+|\d+', s))
# => ['hello', 'world', '123']

图案细节

BONUS:要匹配任何字母子串和各种数字，请使用

[^\W\d_]+|[-+]?\d*\.?\d+(?:[eE][+-]?\d+)?

参见 Parsing scientific notation sensibly?有关正则表达式的详细信息。

关于python - 如何标记化，拆分相邻的数字字母？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54568471/

相关文章：

php - Regex量化捕获