python - 在线性时间内寻找包含特定字符的最短子串

Goal: implementing an algorithm that, given strings a and b, returns the shortest substring of a containing all characters of b. The string b can contain duplicates.

算法基本上就是这个:
http://www.geeksforgeeks.org/find-the-smallest-window-in-a-string-containing-all-characters-of-another-string/

在链接的文章中，该算法只找到最短子串的长度，但这是一个很小的变化。

这是我的实现:

导入集合

def issubset(c1, c2):
    '''Return True if c1 is a subset of c2, False otherwise.'''
    return not c1 - (c1 & c2)


def min_idx(seq, target):
    '''Least index of seq such that seq[idx] is contained in target.'''
    for idx, elem in enumerate(seq):
        if elem in target:
            return idx


def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        current_hist[t] += 1
    minlen = len(current)
    shortest = current
    for t in i:
        current.append(t)
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            idx = min_idx(current[1:], target_hist) + 1
            current = current[idx:]
            if len(current) < minlen:
                minlen = len(current)
                shortest = current
    return current

不幸的是，它不起作用。例如，

>>> minsub('this is a test string', 'tist')
['s', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', 's', 't', 'r', 'i', 'n', 'g'

我错过了什么？
旁注:我不太确定我的实现是 O(n)，但这是一个不同的问题。至于现在，我正在寻求修复我的实现。

编辑: 看似有效的解决方案:

import collections


def issubset(c1, c2):
    '''Return True if c1 is a subset of c2, False otherwise.'''
    return not c1 - (c1 & c2)


def min_idx(seq, target):
    '''Least index of seq such that seq[idx] is contained in target.'''
    for idx, elem in enumerate(seq):
        if elem in target:
            return idx


def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        current_hist[t] += 1
    minlen = len(current)
    shortest = current[:]
    for t in i:
        current.append(t)
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            current_hist = collections.Counter(current)
            for idx, elem in enumerate(current[1:], 1):
                if not current_hist[elem] - target_hist[elem]:
                    break
                current_hist[elem] -= 1
            current = current[idx:]
            if len(current) < minlen:
                minlen = len(current)
                shortest = current[:]
    return shortest

最佳答案

问题出在这一步，当我们向 current 添加一个字符并且它匹配第一个字符时:

remove the leftmost character and all other extra characters after left most character.

idx的这个值

            idx = min_idx(current[1:], target_hist) + 1

有时低于预期:只要 current_hist 是 target_hist 的子集，idx 就应该增加。因此，我们需要使 current_hist 保持最新，以便为 idx 计算正确的值。此外，minsub 应该返回 shortest 而不是 current。

def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        if t in target_hist:
            current_hist[t] += 1
    minlen = len(current)
    shortest = current
    #current = []
    for t in i:
        current.append(t)
        current_hist[t] += 1
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            #idx = min_idx(current[1:], target_hist) + 1
            idx = 0
            while issubset(target_hist, current_hist):
                u = current[idx]
                current_hist[u] -= 1
                idx += 1
            idx -= 1
            u = current[idx]
            current_hist[u] += 1
            current = current[idx:]
        if len(current) < minlen:
            minlen = len(current)
            shortest = current[:]
    return shortest

In [9]: minsub('this is a test string', 'tist')
Out[9]: ['t', ' ', 's', 't', 'r', 'i']

关于python - 在线性时间内寻找包含特定字符的最短子串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31608133/

python - 在线性时间内寻找包含特定字符的最短子串

上一篇：java - 如何根据 Java 中的几个条件从列表中删除重复项

下一篇：arrays - 将整数数组分解为具有精确和的最大子数组数