python - 在线性时间内寻找包含特定字符的最短子串

标签 python string algorithm

Goal: implementing an algorithm that, given strings a and b, returns the shortest substring of a containing all characters of b. The string b can contain duplicates.

算法基本上就是这个:
http://www.geeksforgeeks.org/find-the-smallest-window-in-a-string-containing-all-characters-of-another-string/

在链接的文章中,该算法只找到最短子串的长度,但这是一个很小的变化。

这是我的实现:

导入集合

def issubset(c1, c2):
    '''Return True if c1 is a subset of c2, False otherwise.'''
    return not c1 - (c1 & c2)


def min_idx(seq, target):
    '''Least index of seq such that seq[idx] is contained in target.'''
    for idx, elem in enumerate(seq):
        if elem in target:
            return idx


def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        current_hist[t] += 1
    minlen = len(current)
    shortest = current
    for t in i:
        current.append(t)
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            idx = min_idx(current[1:], target_hist) + 1
            current = current[idx:]
            if len(current) < minlen:
                minlen = len(current)
                shortest = current
    return current

不幸的是,它不起作用。例如,

>>> minsub('this is a test string', 'tist')
['s', ' ', 'i', 's', ' ', 'a', ' ', 't', 'e', 's', 't', ' ', 's', 't', 'r', 'i', 'n', 'g'

我错过了什么?
旁注:我不太确定我的实现是 O(n),但这是一个不同的问题。至于现在,我正在寻求修复我的实现。

编辑: 看似有效的解决方案:

import collections


def issubset(c1, c2):
    '''Return True if c1 is a subset of c2, False otherwise.'''
    return not c1 - (c1 & c2)


def min_idx(seq, target):
    '''Least index of seq such that seq[idx] is contained in target.'''
    for idx, elem in enumerate(seq):
        if elem in target:
            return idx


def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        current_hist[t] += 1
    minlen = len(current)
    shortest = current[:]
    for t in i:
        current.append(t)
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            current_hist = collections.Counter(current)
            for idx, elem in enumerate(current[1:], 1):
                if not current_hist[elem] - target_hist[elem]:
                    break
                current_hist[elem] -= 1
            current = current[idx:]
            if len(current) < minlen:
                minlen = len(current)
                shortest = current[:]
    return shortest

最佳答案

问题出在这一步,当我们向 current 添加一个字符并且它匹配第一个字符时:

remove the leftmost character and all other extra characters after left most character.

idx的这个值

            idx = min_idx(current[1:], target_hist) + 1

有时低于预期:只要 current_histtarget_hist 的子集,idx 就应该增加。因此,我们需要使 current_hist 保持最新,以便为 idx 计算正确的值。此外,minsub 应该返回 shortest 而不是 current

def minsub(a, b):
    target_hist = collections.Counter(b)
    current_hist = collections.Counter()
    # Skip all the useless characters
    idx = min_idx(a, target_hist)
    if idx is None:
        return []
    a = a[idx:]
    # Build a base substring
    i = iter(a)
    current = []
    while not issubset(target_hist, current_hist):
        t = next(i)
        current.append(t)
        if t in target_hist:
            current_hist[t] += 1
    minlen = len(current)
    shortest = current
    #current = []
    for t in i:
        current.append(t)
        current_hist[t] += 1
        # Shorten the substring from the front as much as possible
        if t == current[0]:
            #idx = min_idx(current[1:], target_hist) + 1
            idx = 0
            while issubset(target_hist, current_hist):
                u = current[idx]
                current_hist[u] -= 1
                idx += 1
            idx -= 1
            u = current[idx]
            current_hist[u] += 1
            current = current[idx:]
        if len(current) < minlen:
            minlen = len(current)
            shortest = current[:]
    return shortest
In [9]: minsub('this is a test string', 'tist')
Out[9]: ['t', ' ', 's', 't', 'r', 'i']

关于python - 在线性时间内寻找包含特定字符的最短子串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31608133/

相关文章:

Python 闭包,默认参数不等于使用 functools.partial 的解决方案?

python - Tastypie 如何将模型中的错误正确地反馈给用户?

javascript - 如何检测 ":"之前的单词并使用 jquery 包装在 "span"中?

algorithm - 是否有将图像直方图转换为原始图像的算法?

algorithm - 图和树的DFS区别

python - 是否有等效于 Perls 'package' 关键字的 Python

python - torch ,属性错误: module 'torch' has no attribute 'Tensor'

string - 在Go中合并存储在 channel 上的多个 map (对相同键的值求和)

Java 正则表达式使用模式替换文本之间

java - 针对特定数据字段的高效搜索算法