python - 如何高效去除字符串中连续重复的单词或短语

标签 python python-3.x string

<分区>

我有一个包含重复出现的短语的字符串,或者它甚至可能是一个连续多次出现的单词。

尝试了各种方法,但找不到更节省时间和空间的方法。

这是我尝试过的方法

  1. 分组()
  2. 重新
String = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
s1 = " ".join([k for k,v in groupby(String.replace("&lt;/Sent&gt;","").split())])
s2 = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', String)

他们两个似乎都不适用于我的情况

我的预期结果:

什么类型的人最有可能达到 1.35?

这些是我引用的一些帖子

  1. Is there a way to remove duplicate and continuous words/phrases in a string? - 不起作用
  2. How can I remove duplicate words in a string with Python? - 部分工作,但也需要针对大字符串的最佳方式

请不要将我的问题标记为与上面的帖子重复,因为我尝试了大部分实现但没有找到有效的解决方案。

最佳答案

我会采用这种寻找长度不断增加的重复项的创造性方法:

input = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
def combine_words(input,length):
    combined_inputs = []
    if len(splitted_input)>1:
        for i in range(len(input)-1):
            combined_inputs.append(input[i]+" "+last_word_of(splitted_input[i+1],length)) #add the last word of the right-neighbour (overlapping) sequence (before it has expanded), which is the next word in the original sentence
    return combined_inputs, length+1

def remove_duplicates(input, length):
    bool_broke=False #this means we didn't find any duplicates here
    for i in range(len(input) - length):
        if input[i]==input[i + length]: #found a duplicate piece of sentence!
            for j in range(0,length): #remove the overlapping sequences in reverse order
                del input[i + length - j]
            bool_broke = True
            break #break the for loop as the loop length does not matches the length of splitted_input anymore as we removed elements
    if bool_broke:
        return remove_duplicates(input, length) #if we found a duplicate, look for another duplicate of the same length
    return input

def last_word_of(input,length):
    splitted = input.split(" ")
    if len(splitted)==0:
        return input
    else:
        return splitted[length-1]

#make a list of strings which represent every sequence of word_length adjacent words
splitted_input = input.split(" ")
word_length = 1
splitted_input,word_length = combine_words(splitted_input,word_length)

intermediate_output = False

while len(splitted_input)>1:
    splitted_input = remove_duplicates(splitted_input,word_length) #look whether two sequences of length n (with distance n apart) are equal. If so, remove the n overlapping sequences
    splitted_input, word_length = combine_words(splitted_input,word_length) #make even bigger sequences
    if intermediate_output:
        print(splitted_input)
        print(word_length)
output = splitted_input[0] #In the end you have a list of length 1, with all possible lengths of repetitive words removed

输出流利的

what type of people were most likely to be able to be 1.35 ?

即使它不是所需的输出,我也看不出它如何识别删除之前 3 个位置出现的“to be”(长度为 2)。

关于python - 如何高效去除字符串中连续重复的单词或短语,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57424661/

相关文章:

python - 用 python 3 解开一个 python 2 对象

python - 将标准输出从多个进程重定向到 python 日志记录模块

python - pyaudio-OSError : [Errno -9999] Unanticipated host error

python - Seaborn BarPlot 反转 y 轴并将 x 轴保持在图表区域的底部

java - 从文本文件的一行中删除单词

python - 查找等于特定总和的列表部分排列的有效方法

python-3.x - Python Windows 10 64bit - 用于 trackpy 的 FFMPEG

python - 强制运行特定方法

c++ - 在 C++ 中拆分点上的字符串并从中提取所有字段?

php - 将 url 与字符串分开?