algorithm - 根据术语的位置对文本字符串进行排名

我基本上需要一些数学知识来根据以下指标对短输入句子进行排名:

1) 术语相对于句子开头的距离(注意:相对术语距离，没有编辑距离!)。对于搜索“a”的示例，句子“a b”的排名应该高于“b a”，因为 a 更接近句子的开头。

2) 术语之间的距离。例如搜索“a”和“b”时，“ccc a b”的排名应该高于“a ccc b”，因为 a 和 b 彼此更接近。

3) 基于术语顺序的排名。例如搜索 a AND b 时，“a b”的排名应该高于“b a”，因为这是正确的顺序。尽管如此，b a 也应该在结果集中，因此它也必须按较低的权重排名。

4) 单词本身是未加权的。这是与广泛常见的内容以及我可以轻松找到信息的内容的主要区别。但在我的例子中，所有术语都具有相同的权重，无论它们在文档中出现/计数或其他什么。

我已完成研究，但未找到匹配项。您知道什么排名算法会匹配，或者至少接近这个吗？

最佳答案

计算每个搜索词在主题字符串中的位置。
计算所有字词在搜索字符串中的平均位置。
计算主题字符串和搜索词列表中平均位置之间的绝对差值。
计算词条位置相对于平均值的绝对差值。

decimal Rank(string subject, IList<string> terms)
{
    // Isolate all the words in the subject.
    var words = Regex.Matches(subject, @"\w+")
        .Cast<Match>()
        .Select(m => m.Value.ToLower())
        .ToList();

    // Calculate the positions
    var positions = new List<int>();
    var sumPositions = 0;
    foreach (var term in terms)
    {
        int pos = words.IndexOf(term.ToLower());
        if (pos < 0) return decimal.MaxValue;
        positions.Add(pos);
        sumPositions += pos;
    }

    // Calculate the difference in average positions
    decimal averageSubject = (decimal) sumPositions / terms.Count;
    decimal averageTerms = (terms.Count - 1) / 2m; // average(0..n-1)
    decimal rank = Math.Abs(averageSubject - averageTerms);

    for (int i = 0; i < terms.Count; i++)
    {
        decimal relativePos1 = positions[i] - averageSubject;
        decimal relativePos2 = i - averageTerms;
        rank += Math.Abs(relativePos2 - relativePos1);
    }

    return rank;
}

我使用较低的值表示更好的匹配，因为测量与完美匹配的距离比测量每个匹配的分数更容易。

示例

Subject     Terms       Rank
"a b"       "a"         0.0
"b a"       "a"         1.0
"ccc a b"   "a", "b"    1.0
"a ccc b"   "a", "b"    1.5
"a b"       "a", "b"    0.0
"b a"       "a", "b"    2.0

关于algorithm - 根据术语的位置对文本字符串进行排名，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17163829/

algorithm - 根据术语的位置对文本字符串进行排名

上一篇：python - 多对多的数据结构

下一篇：algorithm - 幻灯片内容的校验和