python - 如何找到两个字符串的并集并保持顺序

标签 python string list union

我有两个字符串,我想找到它们的并集。在这样做的同时,我想维持秩序。我这样做的目的是尝试多种方法对图像进行 OCR 并获得不同的结果。我想将所有不同的结果组合成一个内容最多的结果。

这至少是我所追求的:

#example1
string1 = "This is a test trees are green roses are red"
string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
finalstring = "this is a test trees are green roses are red 12.48.1952 anthony gonzalez" 

#example2
string2 = "This is a test trees are green roses are red"
string1 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
finalstring = "this is a test trees are green roses are red 12.48.1952 anthony gonzalez"

#example3
string1 = "telephone conversation in some place big image on screen"
String2 = "roses are red telephone conversation in some place big image on screen"
finalstring = "roses are red telephone conversation in some place big image on screen"
#or the following - both are fine in this scenario.
finalstring = "telephone conversation in some place big image on screen roses are red "

这是我试过的:

>>> string1 = "This is a test trees are green roses are red"
>>> string2 = "This iS a TEST trees 12.48.1952 anthony gonzalez"
>>> list1 = string1.split(" ")
>>> list2 = string2.split(" ")
>>> " ".join(list(set(list1) | set(list2))).lower()
'a gonzalez this is trees anthony roses green are test 12.48.1952 test is red'

最佳答案

您可以使用 difflib.SequenceMatcher为此:

import difflib
def merge (l, r):
    m = difflib.SequenceMatcher(None, l, r)
    for o, i1, i2, j1, j2 in m.get_opcodes():
        if o == 'equal':
            yield l[i1:i2]
        elif o == 'delete':
            yield l[i1:i2]
        elif o == 'insert':
            yield r[j1:j2]
        elif o == 'replace':
            yield l[i1:i2]
            yield r[j1:j2]

这样使用:

>>> string1 = 'This is a test trees are green roses are red'
>>> string2 = 'This iS a TEST trees 12.48.1952 anthony gonzalez'

>>> merged = merge(string1.lower().split(), string2.lower().split())
>>> ' '.join(' '.join(x) for x in merged)
'this is a test trees are green roses are red 12.48.1952 anthony gonzalez'

如果要在字符级别执行合并,只需修改调用以直接对字符串(而不是单词列表)进行操作:

>>> merged = merge(string1.lower(), string2.lower())
>>> ''.join(merged)
'this is a test trees 12.48.1952 arenthony gronzaleen roses are redz'

此解决方案正确地维护了字符串各个部分的顺序。因此,如果两个字符串都以公共(public)部分结尾,但在结尾之前有不同的段,那么这两个不同的段仍将出现在结果中的公共(public)结尾之前。例如,合并 A B DA C D 将得到 A B C D

因此,您可以通过简单地删除结果字符串的部分内容,以正确的顺序找到每个原始字符串。如果从该示例结果中删除 C,您将取回第一个字符串;如果您改为删除 B,则会取回第二个字符串。这也适用于更复杂的合并。

关于python - 如何找到两个字符串的并集并保持顺序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37263682/

相关文章:

string - 根据 unicode 对字符串进行排序

c - 从 C 中的字符串数组中删除重复项

java - 连接两个列表并更改特定数据

scala - 计算元素出现的次数

list - 查找每个数字在列表中出现的次数

python - 子类化 matplotlib 文本 : manipulate properties of child artist

python - 从 bash 命令行获取未扩展的参数

regex - 检查当前单词之前是否有多个单词

python - 如何将两个相同的值分组并从 django、DRF 中的其他字段获取值

python - 带有 networkX 的子树