c# - 找出哪些短语在字符串中被多次使用

标签 c# algorithm text

通过使用字典来识别哪些词被最频繁地使用,很容易计算文件中词的出现次数,但是给定一个文本文件,我如何找到常用的短语,其中一个“短语”是一组两个或更多连续的单词?

例如,这里是一些示例文本:

Except oral wills, every will shall be in writing, but may be handwritten or typewritten. The will shall contain the testator's signature or by some other person in the testator's conscious presence and at the testator's express direction . The will shall be attested and subscribed in the conscious presence of the testator, by two or more competent witnesses, who saw the testator subscribe, or heard the testator acknowledge the testator's signature.

For purposes of this section, conscious presence means within the range of any of the testator's senses, excluding the sense of sight or sound that is sensed by telephonic, electronic, or other distant communication.

我如何确定短语“conscious presence”(3 次)和“testator's signature”(2 次)出现了不止一次(除了蛮力搜索每组两三个词)?

我将用 C# 编写此代码,因此 C# 代码会很棒,但我什至无法确定一个好的算法,因此我将接受任何代码甚至伪代码来解决这个问题。

最佳答案

我想我会快速解决这个问题 - 不确定这是否不是您试图避免的蛮力方法 - 但是:

static void Main(string[] args)
{
    string txt = @"Except oral wills, every will shall be in writing, 
but may be handwritten or typewritten. The will shall contain the testator's 
signature or by some other person in the testator's conscious presence and at the
testator's express direction . The will shall be attested and subscribed in the
conscious presence of the testator, by two or more competent witnesses, who saw the
testator subscribe, or heard the testator acknowledge the testator's signature.

For purposes of this section, conscious presence means within the range of any of the
testator's senses, excluding the sense of sight or sound that is sensed by telephonic,
electronic, or other distant communication.";

    //split string using common seperators - could add more or use regex.
    string[] words = txt.Split(',', '.', ';', ' ', '\n', '\r');

    //trim each tring and get rid of any empty ones
    words = words.Select(t=>t.Trim()).Where(t=>t.Trim()!=string.Empty).ToArray();

    const int MaxPhraseLength = 20;

    Dictionary<string, int> Counts = new Dictionary<string,int>();

    for (int phraseLen = MaxPhraseLength; phraseLen >= 2; phraseLen--)
    {
        for (int i = 0; i < words.Length - 1; i++)
        {
            //get the phrase to match based on phraselen
            string[] phrase = GetPhrase(words, i, phraseLen);
            string sphrase = string.Join(" ", phrase);

            Console.WriteLine("Phrase : {0}", sphrase);

            int index = FindPhraseIndex(words, i+phrase.Length, phrase);

            if (index > -1)
            {
                Console.WriteLine("Phrase : {0} found at {1}", sphrase, index);

                if(!Counts.ContainsKey(sphrase))
                    Counts.Add(sphrase, 1);

                Counts[sphrase]++;
            }
        }
    }

    foreach (var foo in Counts)
    {
        Console.WriteLine("[{0}] - {1}", foo.Key, foo.Value);
    }

    Console.ReadKey();
}

static string[] GetPhrase(string[] words, int startpos, int len)
{
    return words.Skip(startpos).Take(len).ToArray();
}

static int  FindPhraseIndex(string[] words, int startIndex, string[] matchWords)
{
    for (int i = startIndex; i < words.Length; i++)
    {
        int j;

        for(j=0; j<matchWords.Length && (i+j)<words.Length; j++)
            if(matchWords[j]!=words[i+j])
                break;

        if (j == matchWords.Length)
            return startIndex;
    }

    return -1;
}

关于c# - 找出哪些短语在字符串中被多次使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18641986/

相关文章:

c# - 创建 "frozen window"效果

algorithm - 按成对差异对数组进行排序

excel - 将 for 循环转换为 Excel

php - mysql不通过php将长字符串插入数据库

c# - 什么在空花括号 block 中插入空格?

c# - 是否可以从深度位图中仅提取玩家的深度像素?

c# - WPF DataContext 在看似相同的情况下工作方式不同

python - 获取邻接列表中的所有叶子节点

text - Vim搜索而不会覆盖“/注册

javascript - 当第一个和第二个文本输入值没有 ID 时,更改它们