scala 文本中单词之间的绝对最小距离

标签 scala functional-programming match distance min

我正在尝试查找给定文本中多个单词之间的最小距离。

假设我有一个字符串,例如:“a b cat dogs x y z n m pfox xdog b b cat

求所有子串匹配的最小距离:(fox,dog,cat)

此文本中每个子字符串多次出现:

  • 开头一个:

猫 - 4 狗 - 8 狐狸 - 24

距离 = 24 - 4 = 20

  • 字符串末尾有一个:

狐狸 - 24 狗 - 30 猫 - 38

距离 = 38 - 24 = 14

最小距离 = 14

这是我想出的算法:

 object MinKWindowSum {
  def main(args: Array[String]): Unit = {   

    val document =
      """This Hello World is a huge text with thousands
Java of Hello words and Scala other lines and World and many other Hello docs
Words of World in many langs Hello and features
Java Scala AXVX TXZX ASDQWE OWEQ World asb eere qwerer
asdasd Scala Java Hello docs World KLKM NWQEW ZXCASD OPOOIK Scala ASDSA
"""
    println(getMinWindowSize(document, "Hello World Scala"))
  }   

  def getMinWindowSize(str:String, s:String): Int = {

    /* creates a list of tuples List[(String, Int)] which contains each keyword and its
    respective index found in the text sorted in order by index.
    */
    val keywords = s.split(" ").toSet
    val idxs = keywords.map(k => (k -> ("(?i)\\Q" + k + "\\E").r.findAllMatchIn(str).map(_.start)))
      .map{ case (keyword,itr) => itr.map((keyword, _))}
      .flatMap(identity).toSeq
      .sortBy(_._2)

    // Calculates the min window on the next step.
    var min = Int.MaxValue
    var minI, minJ = -1

    // current window indexes and words
    var currIdxs = ListBuffer[Int]()
    var currWords = ListBuffer[String]()

    for(idx <- idxs ) {

      // check if word exists in window already
      val idxOfWord = currWords.indexOf(idx._1)

      if (!currWords.isEmpty && idxOfWord != -1) {
        currWords = currWords.drop(idxOfWord + 1)
        currIdxs = currIdxs.drop(idxOfWord + 1)
      }
      currWords += idx._1
      currIdxs += idx._2

      // if all keys are present check if it is new min window
      if (keywords.size == currWords.length) {
        val currMin = Math.abs(currIdxs.last - currIdxs.head)
        if (min > currMin) {
          min = currMin
          minI = currIdxs.head
          minJ = currIdxs.last
        }
      }
    }

    println("min = " + min + " ,i = " + minI + " j = " + minJ)
    min
  }

}

在上面的示例中,我们尝试找到“Hello World Scala”的所有匹配项之间的最小距离

在索引之间找到索引之间的最短窗口: i = 235,j = 257 --> 分钟 = 22

想知道是否有更好的方法以惯用的方式或在效率、可扩展性、可读性和简单性方面更好的方式来做到这一点?

最佳答案

这是一个稍微“更实用”的替代方案:

val document =
  """This Hello World is a huge text with thousands Java of Hello words and Scala other lines and World and many other Hello docs
     Words of World in many langs Hello and features Java Scala AXVX TXZX ASDQWE OWEQ World
  """
val WORDS = Set("Hello", "World", "Scala")

var minDistance = document.trim
  .split(" ")
  .foldLeft(List[(String, Int)](), None: Option[Int], 0) {
    case ((words, min, idx), word) if WORDS.contains(word) =>
      val newWords = (word, idx) :: words.filter(_._1 != word)
      if (newWords.map(_._1).toSet == WORDS) { // toSet on only 3 elmts
        var idxes = newWords.map(_._2)
        var dist = idxes.max - idxes.min
        var newMin = min match {
          case None                    => dist
          case Some(min) if min < dist => min
          case _                       => dist
        }
        (newWords, Some(newMin), idx + word.length + 1)
      }
      else {
        (newWords, min, idx + word.length + 1)
      }
    case ((words, min, idx), word) =>
      (words, min, idx + word.length + 1)
  }
  ._2

println(minDistance)

产生:

Some(38)

关于scala 文本中单词之间的绝对最小距离,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50069285/

相关文章:

scala - 如何将数据框拆分为具有相同列值的数据框?

java - Scala actor 作为参数传递给其他 actor

r - R中两个数据表之间按行计算匹配元素

java - 如何在 Scala 中期望 String* 的函数中使用 Array[String]

java - 如何将实现 java.lang.Comparable 的类转换为实现 Scala.Ordered?

clojure - 什么时候使用 Clojure 的闭包?

python - python中Haskell scanl的等价物

c++ - C++ 中的通用函数包装器是否可行?

python - 使用换行符查找字符串的其余部分 Python

scala - 使用预期类型的​​子参数实现 trait 方法