我正在尝试查找给定文本中多个单词之间的最小距离。
假设我有一个字符串,例如:“a b cat dogs x y z n m pfox xdog b b cat”
求所有子串匹配的最小距离:(fox,dog,cat)
此文本中每个子字符串多次出现:
- 开头一个:
猫 - 4 狗 - 8 狐狸 - 24
距离 = 24 - 4 = 20
- 字符串末尾有一个:
狐狸 - 24 狗 - 30 猫 - 38
距离 = 38 - 24 = 14
最小距离 = 14
这是我想出的算法:
object MinKWindowSum {
def main(args: Array[String]): Unit = {
val document =
"""This Hello World is a huge text with thousands
Java of Hello words and Scala other lines and World and many other Hello docs
Words of World in many langs Hello and features
Java Scala AXVX TXZX ASDQWE OWEQ World asb eere qwerer
asdasd Scala Java Hello docs World KLKM NWQEW ZXCASD OPOOIK Scala ASDSA
"""
println(getMinWindowSize(document, "Hello World Scala"))
}
def getMinWindowSize(str:String, s:String): Int = {
/* creates a list of tuples List[(String, Int)] which contains each keyword and its
respective index found in the text sorted in order by index.
*/
val keywords = s.split(" ").toSet
val idxs = keywords.map(k => (k -> ("(?i)\\Q" + k + "\\E").r.findAllMatchIn(str).map(_.start)))
.map{ case (keyword,itr) => itr.map((keyword, _))}
.flatMap(identity).toSeq
.sortBy(_._2)
// Calculates the min window on the next step.
var min = Int.MaxValue
var minI, minJ = -1
// current window indexes and words
var currIdxs = ListBuffer[Int]()
var currWords = ListBuffer[String]()
for(idx <- idxs ) {
// check if word exists in window already
val idxOfWord = currWords.indexOf(idx._1)
if (!currWords.isEmpty && idxOfWord != -1) {
currWords = currWords.drop(idxOfWord + 1)
currIdxs = currIdxs.drop(idxOfWord + 1)
}
currWords += idx._1
currIdxs += idx._2
// if all keys are present check if it is new min window
if (keywords.size == currWords.length) {
val currMin = Math.abs(currIdxs.last - currIdxs.head)
if (min > currMin) {
min = currMin
minI = currIdxs.head
minJ = currIdxs.last
}
}
}
println("min = " + min + " ,i = " + minI + " j = " + minJ)
min
}
}
在上面的示例中,我们尝试找到“Hello World Scala”的所有匹配项之间的最小距离
在索引之间找到索引之间的最短窗口: i = 235,j = 257 --> 分钟 = 22
想知道是否有更好的方法以惯用的方式或在效率、可扩展性、可读性和简单性方面更好的方式来做到这一点?
最佳答案
这是一个稍微“更实用”的替代方案:
val document =
"""This Hello World is a huge text with thousands Java of Hello words and Scala other lines and World and many other Hello docs
Words of World in many langs Hello and features Java Scala AXVX TXZX ASDQWE OWEQ World
"""
val WORDS = Set("Hello", "World", "Scala")
var minDistance = document.trim
.split(" ")
.foldLeft(List[(String, Int)](), None: Option[Int], 0) {
case ((words, min, idx), word) if WORDS.contains(word) =>
val newWords = (word, idx) :: words.filter(_._1 != word)
if (newWords.map(_._1).toSet == WORDS) { // toSet on only 3 elmts
var idxes = newWords.map(_._2)
var dist = idxes.max - idxes.min
var newMin = min match {
case None => dist
case Some(min) if min < dist => min
case _ => dist
}
(newWords, Some(newMin), idx + word.length + 1)
}
else {
(newWords, min, idx + word.length + 1)
}
case ((words, min, idx), word) =>
(words, min, idx + word.length + 1)
}
._2
println(minDistance)
产生:
Some(38)
关于scala 文本中单词之间的绝对最小距离,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50069285/