Goal: implementing an algorithm that, given strings
a
andb
, returns the shortest substring ofa
containing all characters ofb
. The stringb
can contain duplicates.
在链接的文章中,该算法只找到最短子串的长度,但这是一个很小的变化。
这是我的实现:
导入集合
def issubset(c1, c2):
'''Return True if c1 is a subset of c2, False otherwise.'''
return not c1 - (c1 & c2)
def min_idx(seq, target):
'''Least index of seq such that seq[idx] is contained in target.'''
for idx, elem in enumerate(seq):
if elem in target:
return idx
def minsub(a, b):
target_hist = collections.Counter(b)
current_hist = collections.Counter()
# Skip all the useless characters
idx = min_idx(a, target_hist)
if idx is None:
return []
a = a[idx:]
# Build a base substring
i = iter(a)
current = []
while not issubset(target_hist, current_hist):
t = next(i)
current.append(t)
current_hist[t] += 1
minlen = len(current)
shortest = current
for t in i:
current.append(t)
# Shorten the substring from the front as much as possible
if t == current[0]:
idx = min_idx(current[1:], target_hist) + 1
current = current[idx:]
if len(current) < minlen:
minlen = len(current)
shortest = current
return current
不幸的是,它不起作用。例如,
>>> minsub('this is a test string', 'tist')
['s', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', 's', 't', 'r', 'i', 'n', 'g'
我错过了什么?
旁注:我不太确定我的实现是 O(n),但这是一个不同的问题。至于现在,我正在寻求修复我的实现。
编辑: 看似有效的解决方案:
import collections
def issubset(c1, c2):
'''Return True if c1 is a subset of c2, False otherwise.'''
return not c1 - (c1 & c2)
def min_idx(seq, target):
'''Least index of seq such that seq[idx] is contained in target.'''
for idx, elem in enumerate(seq):
if elem in target:
return idx
def minsub(a, b):
target_hist = collections.Counter(b)
current_hist = collections.Counter()
# Skip all the useless characters
idx = min_idx(a, target_hist)
if idx is None:
return []
a = a[idx:]
# Build a base substring
i = iter(a)
current = []
while not issubset(target_hist, current_hist):
t = next(i)
current.append(t)
current_hist[t] += 1
minlen = len(current)
shortest = current[:]
for t in i:
current.append(t)
# Shorten the substring from the front as much as possible
if t == current[0]:
current_hist = collections.Counter(current)
for idx, elem in enumerate(current[1:], 1):
if not current_hist[elem] - target_hist[elem]:
break
current_hist[elem] -= 1
current = current[idx:]
if len(current) < minlen:
minlen = len(current)
shortest = current[:]
return shortest
最佳答案
问题出在这一步,当我们向 current
添加一个字符并且它匹配第一个字符时:
remove the leftmost character and all other extra characters after left most character.
idx
的这个值
idx = min_idx(current[1:], target_hist) + 1
有时低于预期:只要 current_hist
是 target_hist
的子集,idx
就应该增加。因此,我们需要使 current_hist
保持最新,以便为 idx
计算正确的值。此外,minsub
应该返回 shortest
而不是 current
。
def minsub(a, b):
target_hist = collections.Counter(b)
current_hist = collections.Counter()
# Skip all the useless characters
idx = min_idx(a, target_hist)
if idx is None:
return []
a = a[idx:]
# Build a base substring
i = iter(a)
current = []
while not issubset(target_hist, current_hist):
t = next(i)
current.append(t)
if t in target_hist:
current_hist[t] += 1
minlen = len(current)
shortest = current
#current = []
for t in i:
current.append(t)
current_hist[t] += 1
# Shorten the substring from the front as much as possible
if t == current[0]:
#idx = min_idx(current[1:], target_hist) + 1
idx = 0
while issubset(target_hist, current_hist):
u = current[idx]
current_hist[u] -= 1
idx += 1
idx -= 1
u = current[idx]
current_hist[u] += 1
current = current[idx:]
if len(current) < minlen:
minlen = len(current)
shortest = current[:]
return shortest
In [9]: minsub('this is a test string', 'tist')
Out[9]: ['t', ' ', 's', 't', 'r', 'i']
关于python - 在线性时间内寻找包含特定字符的最短子串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31608133/