python - 使用字符串列表作为模式分割字符串

考虑输入字符串：

mystr = "just some stupid string to illustrate my question"

以及一个字符串列表，指示在何处拆分输入字符串：

splitters = ["some", "illustrate"]

输出应该是

result = ["just ", "some stupid string to ", "illustrate my question"]

我写了一些代码来实现下面的方法。对于splitters中的每个字符串，我会在输入字符串中找到它的出现位置，并插入一些我确信不会成为输入字符串一部分的内容（例如，this'!!'）。然后我使用刚才插入的子字符串拆分字符串。

for s in splitters:
    mystr = re.sub(r'(%s)'%s,r'!!\1', mystr)

result = re.split('!!', mystr)

这个解决方案看起来很难看，有没有更好的方法呢？

最佳答案

使用re.split拆分将始终从输出中删除匹配的字符串（NB，这不完全正确，请参阅下面的编辑）因此，必须使用正的前瞻表达式（(?=...)）进行匹配，而不删除匹配项。但是，re.split忽略空匹配，因此仅使用lookahead表达式是行不通的相反，您将在每次拆分时至少丢失一个字符（即使尝试用“边界”匹配（re）欺骗\b，也不起作用）。如果您不介意在每个项的末尾丢失一个空白/非字字符（假设您只在非字字符处拆分），则可以使用

re.split(r"\W(?=some|illustrate)")

会给

["just", "some stupid string to", "illustrate my question"]

（注意just和to后面的空格丢失）然后可以使用str.join以编程方式生成这些正则表达式。注意，每个拆分标记都用re.escape转义，以便splitters项中的特殊字符不会以任何不希望的方式影响正则表达式的含义（想象一下，例如，其中一个字符串中的)，否则将导致正则表达式语法错误）。

the_regex = r"\W(?={})".format("|".join(re.escape(s) for s in splitters))

编辑（HT to@Arkadiy）：对实际匹配项进行分组，即使用(\W)而不是\W，将插入列表的非单词字符作为单独的项返回然后，将每两个后续项连接起来也会生成所需的列表然后，也可以通过使用(.)而不是\W来取消非单词字符的要求：

the_new_regex = r"(.)(?={})".format("|".join(re.escape(s) for s in splitters))
the_split = re.split(the_new_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest(the_split[::2], the_split[1::2], fillvalue='')]

由于普通文本和辅助字符交替，the_split[::2]包含正常分割文本和the_split[1::2]辅助字符。然后，itertools.izip_longest用于将每个文本项与相应的移除字符和最后一个项（在移除字符中不匹配）与fillvalue组合，即''。然后，使用"".join(x)连接每个元组注意，这需要导入itertools（您当然可以在一个简单的循环中这样做，但是itertools为这些事情提供了非常干净的解决方案）还要注意，在Python 3中itertools.izip_longest被称为itertools.zip_longest。
这导致了正则表达式的进一步简化，因为代替了辅助字符，可以用简单的匹配组来代替前瞻（(some|interesting)而不是(.)(?=some|interesting)）：

the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]

这里，the_raw_split上的切片索引已经交换，因为现在偶数项必须在后面而不是前面添加到项中。还要注意[""] +部分，这是将第一个项目与""配对以确定订单所必需的。
（编辑结束）
或者，您可以（如果您愿意）为每个拆分器使用string.replace而不是re.sub（我认为这是您的情况下的首选，但一般来说可能更有效）

for s in splitters:
    mystr = mystr.replace(s, "!!" + s)

此外，如果使用固定标记指示拆分位置，则不需要re.split，但可以使用string.split：

result = mystr.split("!!")

您还可以做的（而不是依赖替换标记不在字符串中的任何其他位置，或者依赖每个拆分位置前面都有一个非单词字符）是使用string.find在输入中查找拆分字符串，并使用字符串切片来提取片段：

def split(string, splitters):
    while True:
        # Get the positions to split at for all splitters still in the string
        # that are not at the very front of the string
        split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
        if len(split_positions) > 0:
            # There is still somewhere to split
            next_split = min(split_positions)
            yield string[:next_split] # Yield everything before that position
            string = string[next_split:] # Retain the rest of the string
        else:
            yield string # Yield the rest of the string
            break # Done.

在这里，[i for i in (string.find(s) for s in splitters) if i > 0]生成一个位置列表，其中可以找到拆分器，对于字符串中的所有拆分器（为此，i < 0被排除），并且不在开始处（我们（可能）刚刚拆分，因此i == 0也被排除）。如果字符串中还有剩余的部分，我们将（这是一个生成器函数）所有内容（不包括）第一个拆分器（atmin(split_positions)），并用剩余部分替换字符串如果没有剩下的，我们得到字符串的最后一部分并退出函数。因为它使用yield，所以它是一个生成器函数，所以需要使用list将其转换为实际列表。
注意，您也可以用调用yield whatever来替换some_list.append（前提是您在前面定义了some_list），并在最后返回some_list，不过，我不认为这是非常好的代码风格。
TL；博士
如果您可以使用正则表达式，请使用

the_newest_regex = "({})".format("|".join(re.escape(s) for s in splitters))
the_raw_split = re.split(the_newest_regex, mystr)
the_actual_split = ["".join(x) for x in itertools.izip_longest([""] + the_raw_split[1::2], the_raw_split[::2], fillvalue='')]

否则，使用string.find和以下分割功能也可以实现相同的效果：

def split(string, splitters):
    while True:
        # Get the positions to split at for all splitters still in the string
        # that are not at the very front of the string
        split_positions = [i for i in (string.find(s) for s in splitters) if i > 0]
        if len(split_positions) > 0:
            # There is still somewhere to split
            next_split = min(split_positions)
            yield string[:next_split] # Yield everything before that position
            string = string[next_split:] # Retain the rest of the string
        else:
            yield string # Yield the rest of the string
            break # Done.

关于python - 使用字符串列表作为模式分割字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25412996/

python - 使用字符串列表作为模式分割字符串

上一篇：python - 函数调用和变量

下一篇：python - Pandas 聚合——如何保留所有列