string - 给定每个字符计数的字符串的子串数量

标签 string algorithm substring counting sliding-window

给定一个字符串 (s) 和一个整数 k,我们需要找到所有不同字符恰好出现 k 次的子字符串的数量。

示例:s =“aabbcc”,k = 2 输出:6

子字符串 [aa、bb、cc、aabb、bbcc 和 aabbcc] 包含频率为 2 的不同字符。

我能想到的方法是遍历所有子串并存储当前子串的频率,当频率等于k时将结果递增。这将导致最坏情况的复杂度为 O(n*n),其中 n 是字符串 s 的长度。

有没有更好的方法来解决这个问题?

最佳答案

我们可以在 O(n * log(size_of_alphabet)) 中解决这个问题。让f(i) 表示以第i 字符结尾的最有效子字符串。然后:

f(i) ->
  1 + f(j - 1)
  
where j is the rightmost index smaller
than or equal to i where s[j..i] is a
valid substring and (j - 1) is inside
the current window. Call s[j..i] the
"minimal" valid substring ending at
index i.

我们的窗口的一个不变量是,如果某个字符被看到 k + 1 次,我们会将左边界移动到窗口中该字符最左边的实例之上。这保证了当前窗口中有效子字符串的连接字符串中的任何两个子字符串不能具有共享字符,因此仍然是有效的连接。

每次我们到达字符c的第k个实例时,最右边的索引小于或等于i,其中s [j..i] 是一个有效的子字符串,必须从窗口中计数小于 k 的所有字符的右侧开始。为了找到最右边的索引,我们可能还需要向前移动到窗口中已经看到的有效相邻子字符串。

为了找到该索引,我们可以维护一个 max indexed-heap 来存储当前窗口中计数小于 k 的每个不同字符的最右边实例,并按其索引确定优先级,这样我们的 j 始终位于堆根的右侧(或者堆为空)。堆已建立索引,这允许我们删除 O(log(size_of_alphabet)) 中的特定元素。

我们还保留窗口中已经看到的有效最小子字符串的左右边界索引。我们可以使用双端队列来进行 O(1) 更新,因为有效的子字符串可能出现在另一个子字符串的右侧或包含现有子字符串。我们保留左边界的 HashMap 以进行 O(1) 查找。

此外,我们必须保留窗口中每个不同字符的计数,以维持我们的不变性,即 k 之上没有这样的计数,以及它们在窗口中的最左侧索引作为有效子字符串前提条件。

程序:

for each index i in s:
  let c be the character s[i]
  
  if s[i] is the (k+1)th instance of c in the window:
    move the left bound of the window
    just past the leftmost instance of
    c in the window, removing all
    elements in the heap who's rightmost
    instance we passed while updating
    our window; and adding to the heap
    the rightmost instance of characters
    who's count has fallen below k
    as we move the left bound of
    the window. If the boundary moves
    past the left bound of valid minimal
    substrings, remove their boundaries
    from the queue, and their left bound
    from the hashmap.
    
  if s[i] is the kth instance of c:
    remove the previous instance of c
    from the heap.
    if the leftmost instance of c in the
    window is to the right of the heap
    root:
      if (root_index + 1) is the
      left bound of a valid minimal
      substring in our queue:
        we must be adding to the right
        of all of them, so add a new
        valid minimal substring, starting
        at the next index after the
        rightmost of those that ends
        at i
      otherwise:
        add a new valid minimal substring,
        starting at (root_index + 1)
        and ending at i
    
  otherwise:
    remove the previous instance of c
    in the heap and insert this one.
  

例如:

01234567
acbbaacc  k = 2

0 a  heap: (0 a)

1 c  heap: (1 c) <- (0 a)

2 b  heap: (2 b) <- (1 c) <- (0 a)

3 b  kth instance, remove (2 b)
     heap: (1 c) <- (0 a)
     leftmost instance of b is to the
     right of the heap root.
     check root + 1 = 2, which points
     to a new valid substring, add the
     substring to the queue
     queue: (2, 3)
     result: 1 + 0 = 1
     
4 a  kth instance, remove (0 a)
     heap: (1 c)
     queue: (2, 3)
     result: 1
     leftmost instance of a is left
     of the heap root so continue
     
5 a  (k+1)th instance, move left border
     of the window to index 1
     heap: (1 c)
     queue: (2, 3)
     result: 1
     
     (5 a) is now the kth instance of
     a and its leftmost instance is to
     the right of the heap root.
     check root + 1 = 2, which points
     to a valid substring in the queue,
     add new substring to queue
     heap: (1 c)
     queue: (2, 3) -> (4, 5)
     result: 1 + 1 + 1 = 3
     
6 c  kth instance, remove (1 c)
     heap: empty
     add new substring to queue
     queue: (1) -> (2, 3) -> (4, 5) -> (6)
     (for simplicity, the queue here
     is not labeled; labels may be needed
     for the split intervals)
     result: 3 + 1 + 0 = 4
     
7 c  (k+1)th instance, move left border
     of the window to index 2, update queue
     heap: empty
     queue: (2, 3) -> (4, 5)
     result: 4
     
     (7 c) is now the kth instance of c
     heap: empty
     add new substring to queue
     queue: (2, 3) -> (4, 5) -> (6, 7)
     result: 4 + 1 + 2 = 7

关于string - 给定每个字符计数的字符串的子串数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63413388/

相关文章:

java - 是否有一种数据类型在 2 个字母中使用的存储空间比 String 少?

algorithm - WinRar 使用了哪种数据压缩算法?

algorithm - 功能增长中的大O

c# - 将字符串的每个部分在 c# 中的另一个字符串中的第一次出现作为子字符串

C# 子串 : Removing first three characters

linux - Bash: ${string:$i:1} 这是什么意思?

python - 在某些步骤后加入字符串

java - 如何将循环中的单独数字连接成一个整数

javascript - 使用正则表达式的 JavaScript split() 中/(a|b)/和/[ab]/之间的区别

algorithm - Union Find解决旅行商