c# - 突出显示正则表达式匹配中的单词

标签 c# regex

我正在尝试使用 Regex 搜索段落中的某些文本.我希望现实主义者在前后返回 X 个单词,并在所有出现的文本周围添加高亮显示。

例如: 考虑以下段落。结果前后至少应有 10 个字符,且不得截断任何单词。搜索词是“狗”。

The Dog is a pet animal. It is one of the most obedient animals. There are many kinds of dogs in the world. Some of the are very friendly while some of them a dangerous. Dogs are of different color like black, red, white and brown. Some old them have slippery shiny skin and some have rough skin. Dogs are carnivorous animals. They like eating meat. They have four legs, two ears and a tail. Dogs are trained to perform different tasks. They protect us from thieves b) guarding our house. They are loving animals. A dog is called man's best friend. They are used by the police to find hidden things. They are one of the most useful animals in the world. Doggonit!

我想要的结果是一个如下所示的数组:

  • 是宠物
  • 世界上有很多种
  • 危险。 是不同的
  • 皮肤粗糙。 是食肉动物
  • 还有一条尾部。 训练有素
  • 动物。一只
  • 世界。 gonit!

我得到的:

我四处搜索并找到了以下正则表达式,它可以完美地返回所需的结果,但没有添加额外的格式。我创建了几种方法来促进每个功能:

private List<List<string>> Search(string text, string searchTerm, bool searchEntireWord) {
    var result = new List<List<string>>();
    var searchTerms = searchTerm.Split(' ');
        foreach (var word in searchTerms) {
            var searchResults = ExtractParagraph(text, word, sizeOfResult, searchEntireWord);
            result.Add(searchResults);
            if (searchResults.Count > 0) {
                foreach (var searchResult in searchResults) {
                    Response.Write("<strong>Result:</strong> " + searchResult + "<br>");
                }
            }
        }
    return result;
}

private List<string> ExtractParagraph(string text, string searchTerm, sizeOfResult, bool searchEntireWord) {
    var result = new List<string>();
    searchTerm = searchEntireWord ? @"\b" + searchTerm + @"\b" : searchTerm;
    //var expression = @"((^.{0,30}|\w*.{30})\b" + searchTerm + @"\b(.{30}\w*|.{0,30}$))";
    var expression = @"((^.{0," + sizeOfResult + @"}|\w*.{" + sizeOfResult + @"})" + searchTerm + @"(.{" + sizeOfResult + @"}\w*|.{0," + sizeOfResult + @"}$))";
    var wordMatch = new Regex(expression, RegexOptions.IgnoreCase | RegexOptions.Singleline);

    foreach (Match m in wordMatch.Matches(text)) {
        result.Add(m.Value);
    }
    return result;
}

我可以这样调用它:

var text = "The Dog is a pet animal. It is one of...";
var searchResults = Search(text, "dog", 10);
if (searchResults.Count > 0) {
    foreach (var searchResult in searchResults) {
        foreach (var result in searchResult) {
            Response.Write("<strong>Result:</strong> " + result + "<br>");
        }
    }
}

我还不知道在 10 个字符内多次出现该词的结果或如何处理。即:如果一个句子有“A dog is a dog of course!”。我想我可以稍后再处理。

测试:

var searchResults = Search(text, "dog", 0, false); // should include only the matched word
var searchResults = Search(text, "dog", 1, false); // should include the matched word and only one word preceding and following the matched word (if any)
var searchResults = Search(text, "dog", 10, false); // should include the matched word and up to 10 characters (but not cutting off words in the middle) preceding and following it (if any)
var searchResults = Search(text, "dog", 50, false); // should include the matched word and up to 50 characters (but not cutting off words in the middle) preceding and following it (if any)

问题:

我创建的函数允许搜索将 searchTerm 作为整个单词或单词的一部分找到。

我所做的是一个简单的 Replace(word, "<strong>" + word "</strong>")在显示结果时。如果我正在搜索单词的某些部分,这会非常有用。但是当搜索整个单词时,如果结果中包含 searchTerm 作为单词的一部分,则该单词的那部分将突出显示。

例如:如果我搜索“dog”,结果是:“All dogs go to dog heaven”。突出显示为“所有都去天堂”。但我想要“所有的狗都去天堂。”

问题:

问题是我怎样才能得到用一些 HTML 包裹的匹配 词,例如 <strong>或者其他任何我想要的东西?

最佳答案

您的解决方案应该能够做两件主要的事情:1) 提取匹配项,即关键字/短语加上围绕它们的额外左右上下文,以及 2) 用标签包装搜索词。

提取正则表达式(例如,左右各 10 个字符)是

(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)

参见 regex demo .

详情

  • (?si) - 启用 SinglelineIgnoreCase修饰符(. 将匹配所有字符且模式不区分大小写)
  • (?<!\S) - 左侧空白边界
  • .{0,10} - 0 到 10 个字符
  • (?<!\S) - 左侧空白边界
  • \S*dog\S* - dog周围有任何 0+ 个非空白字符(注意:如果 searchEntireWordfalse,您需要从此模式部分中删除 \S*)
  • (?!\S) - 右侧空白边界
  • .{0,10} - 0 到 10 个字符
  • (?!\S) - 右侧空白边界。

在C#中,它将被定义为

var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
if (searchEntireWord) { 
    expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
} 

请注意 {{实际上是文字 {}}是文字 }在格式化的字符串中。

用强标签包装关键术语的第二个正则表达式要简单得多:

Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>")

请注意 $&在替换模式中指的是整个匹配值。

C#代码:

public static List<string> ExtractTexts(string text, string searchTerm, int sizeOfResult, bool searchEntireWord) 
{
    var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    if (searchEntireWord) { 
        expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    } 
    return Regex.Matches(text, expression) 
        .Cast<Match>() 
        .Select(x => Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>"))
        .ToList();
}

Sample usage (see demo) :

var text = "The Dog is a real-pet animal. There's an undogging dog that only undogs non-dogs. It is one of the most obedient animals. There are many kinds of dogs in the world. Some of the are very friendly while some of them a dangerous. Dogs are of different color like black, red, white and brown. Some old them have slippery shiny skin and some have rough skin. Dogs are carnivorous animals. They like eating meat. They have four legs, two ears and a tail. Dogs are trained to perform different tasks. They protect us from thieves b) guarding our house. They are loving animals. A dog is called man's best friend. They are used by the police to find hidden things. They are one of the most useful animals in the world. Doggonit!";
var searchTerm = "dog";
var searchEntireWord = false;
Console.WriteLine("======= 10 ========");
var results = ExtractTexts(text, searchTerm, 10, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

输出:

======= 10 ========
(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)
The <strong>Dog</strong> is a
an un<strong>dog</strong>ging <strong>dog</strong> that
only un<strong>dog</strong>s non-<strong>dog</strong>s.
kinds of <strong>dog</strong>s in the
<strong>Dog</strong>s are of
skin. <strong>Dog</strong>s are
a tail. <strong>Dog</strong>s are
A <strong>dog</strong> is called
world. <strong>Dog</strong>gonit!

另一个例子:

Console.WriteLine("======= 15 ========");
results = ExtractTexts(text, searchTerm, 15, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

输出:

======= 15 ========
(?si)(?<!\S).{0,15}(?<!\S)\S*dog\S*(?!\S).{0,15}(?!\S)
The <strong>Dog</strong> is a real-pet
There's an un<strong>dog</strong>ging <strong>dog</strong> that only
un<strong>dog</strong>s non-<strong>dog</strong>s. It is one of
many kinds of <strong>dog</strong>s in the world.
a dangerous. <strong>Dog</strong>s are of
rough skin. <strong>Dog</strong>s are
and a tail. <strong>Dog</strong>s are trained to
animals. A <strong>dog</strong> is called
in the world. <strong>Dog</strong>gonit!

关于c# - 突出显示正则表达式匹配中的单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53051733/

相关文章:

c# - 使用 OleDB 从 .NET 查询 SQL Server 2005 时区分大小写

c# - 通过工厂创建实例时隔离实例的依赖关系(以及该实例的依赖关系)

python - 在模式中匹配 RegEx 模式

ios - 在 Cocoa 中使用 NSRegularExpression 时出现 Cocoa 错误 2048

javascript - 正则表达式与方程中的数字不匹配

c# - 使用 msi 文件安装程序时未捕获异常

c# - 如何释放或回收 C# 中的字符串?

c# - 已删除的事件处理程序在回发期间继续触发

python - 匹配字符串的字符

正则表达式从列表中提取所有组作为 arrayformula