algorithm - 字符差异/注释算法

标签 algorithm diff

我有一组代表文档历史的字符串。每个字符串都是整个文档 - 尚未进行任何差异分析。

我需要一个相对高效的算法来允许我用它们来自的版本注释文档的子字符串。

例如,如果文档历史是这样的:

Rev1: The quiet fox
Rev2: The quiet brown fox
Rev3: The quick brown fox

算法会给出:

The quick brown fox
1111111331222222111

即修订一加“qui”,修订三加“ck”,修订一加“”,修订二加“棕”,最后修订一加“fox”。

最佳答案

我有一个类库可以很容易地做到这一点,但我不知道它在进行大量或多次此类修订时在性能方面的表现如何。

图书馆在这里:DiffLib on CodePlex (您也可以通过 NuGet 安装它。)

问题中示例的脚本在这里(如果添加对 DiffLib 程序集的引用,则可以在 LINQPad 中运行它):

void Main()
{
    var revs = new string[]
    {
        "The quiet fox",
        "The quiet brown fox",
        "The quick brown fox",
        "The quick brown fox.",
        "The quick brown fox jumped over the lazy dog.",
        "The quick brown fox jumped over the lazy cat.",
        "The Quick Brown Fox jumped over the Lazy Cat.",
    };

    string current = revs[0];
    List<int> owner = new List<int>();
    foreach (char c in current)
        owner.Add(1); // owner 1 owns entire string

    Action<int> dumpRev = delegate(int rev)
    {
        Debug.WriteLine("rev " + rev);
        Debug.WriteLine(current);
        Debug.WriteLine(new string(owner.Select(i => (char)(48 + i)).ToArray()));
        Debug.WriteLine("");
    };
    dumpRev(0);

    for (int index = 1; index < revs.Length; index++)
    {
        int ownerId = index + 1;
        var diff = new DiffLib.Diff<char>(current, revs[index]).ToArray();
        int position = 0;
        foreach (var part in diff)
        {
            if (part.Equal)
                position += part.Length1;
            else
            {
                // get rid of old owner for the part that was
                // removed or replaced
                for (int index2 = 0; index2 < part.Length1; index2++)
                    owner.RemoveAt(position);

                // insert new owner for the part that was
                // added or did replace the old text
                for (int index2 = 0; index2 < part.Length2; index2++)
                    owner.Insert(position, ownerId);
                position += part.Length2;
            }
        }
        current = revs[index];
        dumpRev(index);
    }
}

输出:

rev 0
The quiet fox
1111111111111

rev 1
The quiet brown fox
1111111111222222111

rev 2
The quick brown fox
1111111331222222111

rev 3
The quick brown fox.
11111113312222221114

rev 4
The quick brown fox jumped over the lazy dog.
111111133122222211155555555555555555555555554

rev 5
The quick brown fox jumped over the lazy cat.
111111133122222211155555555555555555555556664

rev 6
The Quick Brown Fox jumped over the Lazy Cat.
111171133172222271155555555555555555755557664

关于algorithm - 字符差异/注释算法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5079679/

相关文章:

svn - Tortoise SVN 不同结构版本之间的差异

diff - 使用差异工具在一行中进行多次更改?

列表中第二大值的算法考试示例

algorithm - T9类型字典背后的数据结构

swift - 从 Reduce 获取任意类型

algorithm - 使用逻辑运算符进行 TRIE 搜索

linux - 如何制作和应用SVN补丁?

java - Java 中的 Git Diff 命令

visual-studio - 如何从 Visual Studio 中为 codeplex 项目生成差异?

c# - 两把 key 比一把好