regex - 获取Text.RegularExpressions.Regex匹配的行号

标签 regex powershell logging regex-group

我使用PowerShell解析日志文件目录,并从日志文件中提取所有XML条目。这样工作还可以。但是,由于日志文件可以包含许多这些xml片段,因此我想将找到的特定匹配的行号也放入我编写的XML文件的文件名中,因此我可以打开日志文件并跳转到对该特定行进行一些根本原因分析。

我认为有一个字段“索引”,我认为是字符数,这可能应该将我引到行号,但是我认为“索引”以某种方式还包含其他一些内容,例如Measure-Object -Character,因为Index的值大于用Measure-Object-Character找到的大小,例如$ m.groups [0] .Captures [0] .Index是9963166,但是日志目录中Measure-Object -Character的整体文件的最大值为9838833,因此我认为它也包括换行符。

所以问题可能是:
如果匹配将“索引”作为属性传递给我,我如何知道“索引”包含多少个换行符?我是否必须从文件中获取“索引”字符,然后检查其中包含多少个换行符,然后再得到该行?大概。

$tag = 'data_json'
$input_dir = $absolute_root_dir + $specific_dir
$output_dir = $input_dir  + 'ParsedDataFiles\'
$OFS = "`r`n"
$nice_specific_dir = $specific_dir.Replace('\','_')
$nice_specific_dir = $nice_specific_dir.Replace(':','_')
$regex = New-Object Text.RegularExpressions.Regex "<$tag>(.+?)<\/$tag>", ('singleline', 'multiline')
New-Item -ItemType Directory -Force -Path $output_dir
Get-ChildItem -Path $input_dir -Name -File | % {   
    $output_file = $output_dir + $nice_specific_dir + $_ + '.'
    $content = Get-Content ($input_dir + $_)
    $i = 0
    foreach($m in $regex.Matches($content)) {        
        $outputfile_xml = $output_file + $i++ + '.xml'
        $outputfile_txt = $output_file + $i++ + '.txt'
        $xml = [xml] ("<" + $tag+ ">" + $m.Groups[1].Value + "</" + $tag + ">")
        $xml.Save($outputfile_xml)
        $j = 0
        $xml.data_json.Messages.source.item | % { $_.SortOrder + ", " + $_.StartOn + ", " + $_.EndOn + ", " + $_.Id } | sort | %  { 
            (($j++).ToString() + ", " + $_ )   | Out-File $outputfile_txt -Append
        }
    }
}

最佳答案

注意:如果保证您的正则表达式匹配的内容不会跨越多行,即,如果确保匹配的文本位于同一行,请考虑使用更简单的基于Select-String的解决方案,如js2010's answer所示;通常,尽管如此,如本答案所示,基于方法/基于表达式的解决方案会更好地执行。

您的第一个问题是,您使用的Get-Content不带-Raw,它会将输入文件读取为行数组,而不是单个多行字符串。

当您将此数组传递给$regex.Matches()时,PowerShell通过将元素与空格连接(默认)来对数组进行字符串化。

因此,使用Get-Content -Raw 读取您的输入文件,确保将其读取为单行多行字符串,并且换行符完整:

# Read entire file as single string
$content = Get-Content -Raw ($input_dir + $_)

与多行字符串匹配后,您可以通过.Substring()Measure-Object -Line 通过计算子字符串中的行数(直到找到每个匹配项的字符索引)来推断行号:

这是一个简化的,独立的示例(如果您还想确定列号,请参见底部):
# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
    <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements.
# Inline option ('(?...)') 's' makes '.' match newlines too
$regex = [regex] '(?s)<title>.+?</title>'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  "Found '$($m.Value)' at index $($m.Index), line $lineNumber"
}

请注意+ 1中的$m.Index + 1,这是确保子字符串不以换行符结尾的必要条件,因为Measure-Object行将忽略这样的尾随换行符。通过包含至少一个其他(非换行符)字符,即匹配元素的<,即使匹配元素从第一列开始,行数也始终是正确的。

以上 yield :

Found '<title>De Profundis</title>' at index 34, line 3
Found '<title>Pygmalion</title>' at index 96, line 6

如果您想,还获得列号(在发现的行上开始匹配的字符的基于1的索引):

确定多行字符串中正则表达式匹配的行号和列号:
# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, columns 5 and 7, respectively.
$content = @'
<catalog>
  <book id="bk101">
    <title>De Profundis</title>
  </book>
  <book id="bk102">
      <title>Pygmalion</title>
  </book>
</catalog>
'@

# Regex that finds all <title> elements, along with the
# string that precedes them on the same line:
# Due to use of capture groups, each match $m will contain:
#  * the matched element: $m.Groups[2].Value
#  * the preceding string on the same line: $m.Groups[1].Value
# Inline options ('(?...)'):
#   * 's' makes '.' match newlines too
#   * 'm' makes '^' and '$' match the starts and ends of *individual lines*
$regex = [regex] '(?sm)(^[^\n]*)(<title>.+?</title>)'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  $columnNumber = 1 + $m.Groups[1].Value.Length
  "Found '$($m.Groups[2].Value)' at line $lineNumber, column $columnNumber."
}

以上 yield :
Found '<title>De Profundis</title>' at line 3, column 5.
Found '<title>Pygmalion</title>' at line 6, column 7.

注意:为简单起见,以上两种解决方案都在每次迭代中从字符串的开头算行。
在大多数情况下,这可能仍然会表现良好;如果不是,请参见下面的性能基准中的变体方法,其中行数是迭代计算的,在给定的迭代中仅对当前和先前匹配之间的行进行计数。

可选阅读:行计数方法的性能比较:

sln's answer建议也使用正则表达式进行行计数。

在性能方面比较这些方法以及上面的.Substring()Measure-Object -Line方法可能会很有趣。

以下测试基于 Time-Command function

示例结果来自macOS 10.14.6上的PowerShell Core 7.0.0-preview.3,平均运行100多次;绝对数量将根据执行环境而有所不同,但是方法的相对排名(Factor列)在平台和PowerShell版本之间似乎是相似的:
  • 有1,000行,最后一行有1个匹配项:

  • Factor Secs (100-run avg.) Command
    ------ ------------------- -------
    1.00   0.001               # .Substring() + Measure-Object -Line, count iteratively…
    1.07   0.001               # .Substring() + Measure-Object -Line, count from start…
    2.22   0.002               # Repeating capture with nested newline capturing…
    6.12   0.006               # Prefix capture group + Measure-Object -Line…
    6.72   0.007               # Prefix capture group + newline-matching regex…
    7.24   0.007               # Prefix Capture group + -split…
    
  • 从10,000行开始,具有20,000行和20个均匀间隔的比赛:

  • Factor Secs (100-run avg.) Command
    ------ ------------------- -------
    1.00   0.014               # .Substring() + Measure-Object -Line, count iteratively…
    2.92   0.042               # Repeating capture with nested newline capturing…
    7.50   0.107               # .Substring() + Measure-Object -Line, count from start…
    8.39   0.119               # Prefix capture group + Measure-Object -Line…
    9.50   0.135               # Prefix capture group + newline-matching regex…
    9.94   0.141               # Prefix Capture group + -split…
    

    注释和结论:
  • Prefix capture group指sln的答案中“Way1”的(变体),而Repeating capture group ...指“Way2”。
  • 注意:对于Way2,下面使用(适应)正则表达式(?:.*(\r?\n))*?.*?(match_me),这是稍后在注释中添加的sln的改进版本,而该版本仍显示在其答案的正文中(截至撰写本文时)- ^(?:.*((?:\r?\n)?))*?(match_me)-在循环中无法处理多个匹配项。
  • 这个答案中的.Substring() + Measure-Object -Line方法在所有情况下都是最快的,但是,只有在执行迭代的,匹配之间的行数时才有很多要循环的匹配项(.Substring() + Measure-Object -Line, count iteratively…),而上述解决方案使用计数线为简单起见,从每个匹配开始(# .Substring() + Measure-Object -Line, count from start…)。
  • 使用Way1方法(Prefix capture group),用于计数前缀匹配中的换行符的特定方法差异不大,尽管Measure-Object -Line也是最快的。

  • 这是测试的源代码;通过修改底部附近的各种变量,很容易尝试匹配计数,输入行总数...

    # The script blocks with the various approaches.
    $sbs =
      { # .Substring() + Measure-Object -Line, count from start
        foreach ($m in [regex]::Matches($txt, 'found')) {
          # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
          # !! start of a line, we need to include at least 1 additional character for the line to register.
          $lineNo = ($txt.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
          "Found at line $lineNo (substring() + Measure-Object -Line, counted from start every time)."
        }
      },
      { # .Substring() + Measure-Object -Line, count iteratively
        $lineNo = 0; $startNdx = 0
        foreach ($m in [regex]::Matches($txt, 'found')) {
          # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
          # !! start of a line, we need to include at least 1 additional character for the line to register.
          $lineNo += ($txt.Substring($startNdx, $m.Index + 1 - $startNdx) | Measure-Object -Line).Lines
          "Found at line $lineNo (substring() + Measure-Object -Line, counted iteratively)."
          $startNdx = $m.Index + $m.Value.Length
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
          --$lineNo
        }
      },
      { # Prefix capture group + Measure-Object -Line
        $lineNo = 0
        foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
          # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
          # !! start of a line, we need to include at least 1 additional character for the line to register.
          $lineNo += ($m.Groups[1].Value + '.' | Measure-Object -Line).Lines
          "Found at line $lineNo (prefix capture group + Substring() + Measure-Object -Line)."
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
          --$lineNo
        }
      },
      { # Prefix capture group + newline-matching regex
        $lineNo = 0
        foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
          $lineNo += 1 + [regex]::Matches($m.Groups[1].Value, '\r?\n').Count
          "Found at line $lineNo (prefix capture group + newline-matching regex)."
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
          --$lineNo
        }
      },
      { # Prefix Capture group + -split
        $lineNo = 0
        foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
          $lineNo += ($m.Groups[1].Value -split '\r?\n').Count
          "Found at line $lineNo (prefix capture group + -split for counting)."
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
          --$lineNo
        }
      },
      { # Repeating capture with nested newline capturing
        $lineNo = 0
        foreach ($m in [regex]::Matches($txt, '(?:.*(\r?\n))*?.*?found')) {
          $lineNo += 1 + $m.Groups[1].Captures.Count
          "Found at line $lineNo (repeating prefix capture group with newline capture)."
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
          --$lineNo
        }
      }
    
    # Set this to 1 for debugging:
    #   * runs the script blocks only once
    #   * with 3 matching strings in the put.
    #   * shows output so that the expected functionality (number of matches, line numbers) can be verified.
    $debug = 0
    
    $matchCount = if ($debug) { 3 } else {
      20 # Set how many matching strings should be present in the input string.
    }
    
    # Sample input:
    # Create N lines that are 60 chars. wide, with the string to find on the last line...
    $n = 1e3 # Set the number of lines per match.
    $txt = ((1..($n-1)).foreach('ToString', '0' * 60) -join "`n") + "`n  found`n"
    # ...and multiply the original string according to how many matches should be present.
    $txt = $txt * $matchCount
    
    $runsToAverage = if ($debug) { 1 } else {
      100   # Set how many test runs to report average timing for.
    }
    $showOutput = [bool] $debug
    
    # Run the tests.
    Time-Command -Count $runsToAverage -OutputToHost:$showOutput $sbs
    

    关于regex - 获取Text.RegularExpressions.Regex匹配的行号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57763789/

    相关文章:

    java正则表达式解析部分标题标签

    powershell - 忽略输出-它不应在屏幕上打印

    javascript - Jest - 当测试失败时记录传递到测试内部调用的函数的参数

    logging - Microsoft.Extensions.Logging 与。日志

    javascript - 有什么方法可以匹配某个字符之前或之后的模式吗?

    asp.net - 将子字符串转换为链接的正则表达式

    javascript - jquery中字母数字和特殊字符的正则表达式

    powershell - 保留PowerShell函数的返回类型

    .Net 秒表和 Powershell

    ipad - 了解(符号化)iOS崩溃日志