我有两个字符串,我想找到所有常用词。例如,
s1 = 'Today is a good day, it is a good idea to have a walk.'
s2 = 'Yesterday was not a good day, but today is good, shall we have a walk?'
考虑 s1 匹配 s2
'Today is' 匹配 'today is' 但 'Today is a' 不匹配 s2 中的任何字符。因此,“今天是”是常见的连续字符之一。同样,我们有 'a good day'、'is'、'a good'、'have a walk'。所以常用词是
common = ['today is', 'a good day', 'is', 'a good', 'have a walk']
我们可以使用正则表达式来做到这一点吗?
非常感谢。
import string
s1 = 'Today is a good day, it is a good idea to have a walk.'
s2 = 'Yesterday was not a good day, but today is good, shall we have a walk?'
z=[]
s1=s1.translate(None, string.punctuation) #remove punctuation
s2=s2.translate(None, string.punctuation)
print s1
print s2
sw1=s1.lower().split() #split it into words
sw2=s2.lower().split()
print sw1,sw2
i=0
while i<len(sw1): #two loops to detect common strings. used while so as to change value of i in the loop itself
x=0
r=""
d=i
#print r
for j in range(len(sw2)):
#print r
if sw1[i]==sw2[j]:
r=r+' '+sw2[j] #if string same keep adding to a variable
x+=1
i+=1
else:
if x>0: # if not same check if there is already one in buffer and add it to result (here z)
z.append(r)
i=d
r=""
x=0
if x>0: #end case of above loop
z.append(r)
r=""
i=d
x=0
i+=1
#print i
print list(set(z))
#O(n^3)