我认为有一个字段“索引”,我认为是字符数,这可能应该将我引到行号,但是我认为“索引”以某种方式还包含其他一些内容,例如Measure-Object -Character,因为Index的值大于用Measure-Object-Character找到的大小,例如$ m.groups [0] .Captures [0] .Index是9963166,但是日志目录中Measure-Object -Character的整体文件的最大值为9838833,因此我认为它也包括换行符。


$tag = 'data_json'
$input_dir = $absolute_root_dir + $specific_dir
$output_dir = $input_dir  + 'ParsedDataFiles\'
$OFS = "`r`n"
$nice_specific_dir = $specific_dir.Replace('\','_')
$nice_specific_dir = $nice_specific_dir.Replace(':','_')
$regex = New-Object Text.RegularExpressions.Regex "<$tag>(.+?)<\/$tag>", ('singleline', 'multiline')
New-Item -ItemType Directory -Force -Path $output_dir
Get-ChildItem -Path $input_dir -Name -File | % {   
    $output_file = $output_dir + $nice_specific_dir + $_ + '.'
    $content = Get-Content ($input_dir + $_)
    $i = 0
    foreach($m in $regex.Matches($content)) {        
        $outputfile_xml = $output_file + $i++ + '.xml'
        $outputfile_txt = $output_file + $i++ + '.txt'
        $xml = [xml] ("<" + $tag+ ">" + $m.Groups[1].Value + "</" + $tag + ">")
        $j = 0
        $xml.data_json.Messages.source.item | % { $_.SortOrder + ", " + $_.StartOn + ", " + $_.EndOn + ", " + $_.Id } | sort | %  { 
            (($j++).ToString() + ", " + $_ )   | Out-File $outputfile_txt -Append


注意:如果保证您的正则表达式匹配的内容不会跨越多行,即,如果确保匹配的文本位于同一行,请考虑使用更简单的基于Select-String的解决方案,如js2010's answer所示;通常,尽管如此,如本答案所示,基于方法/基于表达式的解决方案会更好地执行。



因此,使用Get-Content -Raw 读取您的输入文件,确保将其读取为单行多行字符串,并且换行符完整:

# Read entire file as single string
$content = Get-Content -Raw ($input_dir + $_)

与多行字符串匹配后,您可以通过.Substring()Measure-Object -Line 通过计算子字符串中的行数(直到找到每个匹配项的字符索引)来推断行号:

# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, respectively.
$content = @'
  <book id="bk101">
    <title>De Profundis</title>
  <book id="bk102">

# Regex that finds all <title> elements.
# Inline option ('(?...)') 's' makes '.' match newlines too
$regex = [regex] '(?s)<title>.+?</title>'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  "Found '$($m.Value)' at index $($m.Index), line $lineNumber"

请注意+ 1中的$m.Index + 1,这是确保子字符串不以换行符结尾的必要条件,因为Measure-Object行将忽略这样的尾随换行符。通过包含至少一个其他(非换行符)字符,即匹配元素的<,即使匹配元素从第一列开始,行数也始终是正确的。

以上 yield :

Found '<title>De Profundis</title>' at index 34, line 3
Found '<title>Pygmalion</title>' at index 96, line 6


# Sample multi-line input.
# Note: The <title> elements are at lines 3 and 6, columns 5 and 7, respectively.
$content = @'
  <book id="bk101">
    <title>De Profundis</title>
  <book id="bk102">

# Regex that finds all <title> elements, along with the
# string that precedes them on the same line:
# Due to use of capture groups, each match $m will contain:
#  * the matched element: $m.Groups[2].Value
#  * the preceding string on the same line: $m.Groups[1].Value
# Inline options ('(?...)'):
#   * 's' makes '.' match newlines too
#   * 'm' makes '^' and '$' match the starts and ends of *individual lines*
$regex = [regex] '(?sm)(^[^\n]*)(<title>.+?</title>)'

foreach ($m in $regex.Matches($content)) {
  $lineNumber = ($content.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
  $columnNumber = 1 + $m.Groups[1].Value.Length
  "Found '$($m.Groups[2].Value)' at line $lineNumber, column $columnNumber."

以上 yield :
Found '<title>De Profundis</title>' at line 3, column 5.
Found '<title>Pygmalion</title>' at line 6, column 7.



sln's answer建议也使用正则表达式进行行计数。

在性能方面比较这些方法以及上面的.Substring()Measure-Object -Line方法可能会很有趣。

以下测试基于 Time-Command function

示例结果来自macOS 10.14.6上的PowerShell Core 7.0.0-preview.3,平均运行100多次;绝对数量将根据执行环境而有所不同,但是方法的相对排名(Factor列)在平台和PowerShell版本之间似乎是相似的:
  • 有1,000行,最后一行有1个匹配项:

  • Factor Secs (100-run avg.) Command
    ------ ------------------- -------
    1.00   0.001               # .Substring() + Measure-Object -Line, count iteratively…
    1.07   0.001               # .Substring() + Measure-Object -Line, count from start…
    2.22   0.002               # Repeating capture with nested newline capturing…
    6.12   0.006               # Prefix capture group + Measure-Object -Line…
    6.72   0.007               # Prefix capture group + newline-matching regex…
    7.24   0.007               # Prefix Capture group + -split…
  • 从10,000行开始,具有20,000行和20个均匀间隔的比赛:

  • Factor Secs (100-run avg.) Command
    ------ ------------------- -------
    1.00   0.014               # .Substring() + Measure-Object -Line, count iteratively…
    2.92   0.042               # Repeating capture with nested newline capturing…
    7.50   0.107               # .Substring() + Measure-Object -Line, count from start…
    8.39   0.119               # Prefix capture group + Measure-Object -Line…
    9.50   0.135               # Prefix capture group + newline-matching regex…
    9.94   0.141               # Prefix Capture group + -split…

  • Prefix capture group指sln的答案中“Way1”的(变体),而Repeating capture group ...指“Way2”。
  • 注意:对于Way2,下面使用(适应)正则表达式(?:.*(\r?\n))*?.*?(match_me),这是稍后在注释中添加的sln的改进版本,而该版本仍显示在其答案的正文中(截至撰写本文时)- ^(?:.*((?:\r?\n)?))*?(match_me)-在循环中无法处理多个匹配项。
  • 这个答案中的.Substring() + Measure-Object -Line方法在所有情况下都是最快的,但是,只有在执行迭代的,匹配之间的行数时才有很多要循环的匹配项(.Substring() + Measure-Object -Line, count iteratively…),而上述解决方案使用计数线为简单起见,从每个匹配开始(# .Substring() + Measure-Object -Line, count from start…)。
  • 使用Way1方法(Prefix capture group),用于计数前缀匹配中的换行符的特定方法差异不大,尽管Measure-Object -Line也是最快的。

  • 这是测试的源代码;通过修改底部附近的各种变量,很容易尝试匹配计数,输入行总数...

    # The script blocks with the various approaches.
    $sbs =
      { # .Substring() + Measure-Object -Line, count from start
        foreach ($m in [regex]::Matches($txt, 'found')) {
          # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
          # !! start of a line, we need to include at least 1 additional character for the line to register.
          $lineNo = ($txt.Substring(0, $m.Index + 1) | Measure-Object -Line).Lines
          "Found at line $lineNo (substring() + Measure-Object -Line, counted from start every time)."
      { # .Substring() + Measure-Object -Line, count iteratively
        $lineNo = 0; $startNdx = 0
        foreach ($m in [regex]::Matches($txt, 'found')) {
          # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
          # !! start of a line, we need to include at least 1 additional character for the line to register.
          $lineNo += ($txt.Substring($startNdx, $m.Index + 1 - $startNdx) | Measure-Object -Line).Lines
          "Found at line $lineNo (substring() + Measure-Object -Line, counted iteratively)."
          $startNdx = $m.Index + $m.Value.Length
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      { # Prefix capture group + Measure-Object -Line
        $lineNo = 0
        foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
          # !! Measure-Object -Line ignores a trailing \n, so if the match is at the
          # !! start of a line, we need to include at least 1 additional character for the line to register.
          $lineNo += ($m.Groups[1].Value + '.' | Measure-Object -Line).Lines
          "Found at line $lineNo (prefix capture group + Substring() + Measure-Object -Line)."
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      { # Prefix capture group + newline-matching regex
        $lineNo = 0
        foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
          $lineNo += 1 + [regex]::Matches($m.Groups[1].Value, '\r?\n').Count
          "Found at line $lineNo (prefix capture group + newline-matching regex)."
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      { # Prefix Capture group + -split
        $lineNo = 0
        foreach ($m in [regex]::Matches($txt, '(?s)(.*?)found')) {
          $lineNo += ($m.Groups[1].Value -split '\r?\n').Count
          "Found at line $lineNo (prefix capture group + -split for counting)."
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
      { # Repeating capture with nested newline capturing
        $lineNo = 0
        foreach ($m in [regex]::Matches($txt, '(?:.*(\r?\n))*?.*?found')) {
          $lineNo += 1 + $m.Groups[1].Captures.Count
          "Found at line $lineNo (repeating prefix capture group with newline capture)."
          # !! Decrement the cumulative line number to compensate for counting starting at 1 also for subsequent matches.
    # Set this to 1 for debugging:
    #   * runs the script blocks only once
    #   * with 3 matching strings in the put.
    #   * shows output so that the expected functionality (number of matches, line numbers) can be verified.
    $debug = 0
    $matchCount = if ($debug) { 3 } else {
      20 # Set how many matching strings should be present in the input string.
    # Sample input:
    # Create N lines that are 60 chars. wide, with the string to find on the last line...
    $n = 1e3 # Set the number of lines per match.
    $txt = ((1..($n-1)).foreach('ToString', '0' * 60) -join "`n") + "`n  found`n"
    # ...and multiply the original string according to how many matches should be present.
    $txt = $txt * $matchCount
    $runsToAverage = if ($debug) { 1 } else {
      100   # Set how many test runs to report average timing for.
    $showOutput = [bool] $debug
    # Run the tests.
    Time-Command -Count $runsToAverage -OutputToHost:$showOutput $sbs

