我有一个包含重复出现的短语的字符串,或者它甚至可能是一个连续多次出现的单词。
尝试了各种方法,但找不到更节省时间和空间的方法。
这是我尝试过的方法
- 分组()
- 重新
String = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
s1 = " ".join([k for k,v in groupby(String.replace("</Sent>","").split())])
s2 = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', String)
他们两个似乎都不适用于我的情况
我的预期结果:
什么类型的人最有可能达到 1.35?
这些是我引用的一些帖子
- Is there a way to remove duplicate and continuous words/phrases in a string? - 不起作用
- How can I remove duplicate words in a string with Python? - 部分工作,但也需要针对大字符串的最佳方式
请不要将我的问题标记为与上面的帖子重复,因为我尝试了大部分实现但没有找到有效的解决方案。
我会采用这种寻找长度不断增加的重复项的创造性方法:
input = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
def combine_words(input,length):
combined_inputs = []
if len(splitted_input)>1:
for i in range(len(input)-1):
combined_inputs.append(input[i]+" "+last_word_of(splitted_input[i+1],length)) #add the last word of the right-neighbour (overlapping) sequence (before it has expanded), which is the next word in the original sentence
return combined_inputs, length+1
def remove_duplicates(input, length):
bool_broke=False #this means we didn't find any duplicates here
for i in range(len(input) - length):
if input[i]==input[i + length]: #found a duplicate piece of sentence!
for j in range(0,length): #remove the overlapping sequences in reverse order
del input[i + length - j]
bool_broke = True
break #break the for loop as the loop length does not matches the length of splitted_input anymore as we removed elements
if bool_broke:
return remove_duplicates(input, length) #if we found a duplicate, look for another duplicate of the same length
return input
def last_word_of(input,length):
splitted = input.split(" ")
if len(splitted)==0:
return input
else:
return splitted[length-1]
#make a list of strings which represent every sequence of word_length adjacent words
splitted_input = input.split(" ")
word_length = 1
splitted_input,word_length = combine_words(splitted_input,word_length)
intermediate_output = False
while len(splitted_input)>1:
splitted_input = remove_duplicates(splitted_input,word_length) #look whether two sequences of length n (with distance n apart) are equal. If so, remove the n overlapping sequences
splitted_input, word_length = combine_words(splitted_input,word_length) #make even bigger sequences
if intermediate_output:
print(splitted_input)
print(word_length)
output = splitted_input[0] #In the end you have a list of length 1, with all possible lengths of repetitive words removed
输出流利的
what type of people were most likely to be able to be 1.35 ?
即使它不是所需的输出,我也看不出它如何识别删除之前 3 个位置出现的“to be”(长度为 2)。