regex - 使用Powershell和Regex解析固定长度的字段文件,如何将空捕获组替换为零?

标签 regex performance powershell parsing

我正在使用PowerShell脚本和Regex将巨大的(> 1GB)固定字段长度文本文件转换为可导入的制表符分隔文件。代码非常快。如果修剪后它们为空,则需要将某些捕获的字段(假设是第4,第6和第7个字段)更改为0。作为正则表达式捕获的一部分,是否有一种超快的方法来做到这一点,而又不会大大减慢该过程呢?

DATA

ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS 


PROGRAM

$proc_yyyymm = '201912'
match_regex = '^(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{8})(.{10})(.{1})(.{15})(.{12})'

while ($line = $stream_in.ReadLine()) {

   if ($line -match $match_data_regex) {
      $new_line = "$proc_yyyymm`t" + ($Matches[1..($Matches.Count-1)].Trim() -join "`t")
      $stream_out.WriteLine($new_line)
   }
}

最佳答案

在对代码进行一些调整以进行演示之后...

  • 截断正则表达式以匹配示例数据
  • 将输出定界符(现在为$delimiter)更改为,,以便易于查看结果
  • 使用StringReaderStringWriter分别输入和输出

  • 给...

    $text = @'
    ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
    10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS   
    10000000002PLUTO                                      COLUMN VALUE LONGSTARTS   
    '@
    

    ...您建议的在特定索引处调整匹配文本的方式看起来像这样...

    $proc_yyyymm = '201912'
    $match_regex = '^(.{11})(.{24})(.{19})(.{17})(.{9})'
    
    $delimiter = ','
    $indicesToNormalizeToZero = ,2
    
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    
    while ($line = $stream_in.ReadLine()) {
        if ($line -match $match_regex) {
            $trimmedMatches = $Matches[1..($Matches.Count-1)].Trim()
            foreach ($index in $indicesToNormalizeToZero)
            {
                if ($trimmedMatches[$index] -eq '')
                {
                    $trimmedMatches[$index] = '0'
                }
            }
    
            $new_line = "$proc_yyyymm$delimiter" + ($trimmedMatches -join $delimiter)
            $stream_out.WriteLine($new_line)
        }
    }
    
    $stream_out.ToString()
    

    一种替代方法是使用 [Regex]::Replace() method。当您需要对无法用regex substitution表达的匹配项执行自定义转换时,这非常有用。诚然,这可能不合适,因为您要匹配整行而不是单个字段,因此在匹配中,您需要知道哪个字段是哪个字段。

    $proc_yyyymm = '201912'
    $match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'
    $match_evaluator = {
        param($match)
    
        # The first element of Groups contains the entire matched text; skip it
        $fields = $match.Groups `
            | Select-Object -Skip 1 `
            | ForEach-Object -Process {
                $field = $_.Value.Trim()
                if ($groupsToNormalizeToZero -contains $_.Name -and $field -eq '')
                {
                    $field = '0'
                }
    
                return $field
            }
    
        return "$proc_yyyymm$delimiter" + ($fields -join $delimiter)
    }
    
    $delimiter = ','
    # Replace with a HashSet/Hashtable for better lookup performance
    $groupsToNormalizeToZero = ,'3'
    
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    
    while ($line = $stream_in.ReadLine()) {
        $new_line = $match_regex.Replace($line, $match_evaluator)
    
        # The original input string is returned if there was no match
        if (-not [Object]::ReferenceEquals($line, $new_line)) {
            $stream_out.WriteLine($new_line)
        }
    }
    
    $stream_out.ToString()
    
    $match_evaluator MatchEvaluator delegate,在输入文本中找到的每个成功匹配项都会被调用到Replace(),并返回您想要替换文本的内容。在内部,我在进行相同类型的特定于索引的转换,将组名(将其作为[String]的索引)与已知列表($groupsToNormalizeToZero)进行比较;您可以改用命名组,尽管我发现这会更改$match.Groups的顺序。这里可能没有更好的[Regex]::Replace()应用程序,但我现在还没有想到。

    作为使用正则表达式的替代方法,由于已知它们的长度,因此您可以使用 $line method直接从Substring()中提取字段。

    $proc_yyyymm = '201912'
    $delimiter = ','
    
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    
    while ($line = $stream_in.ReadLine()) {
        $id =                $line.Substring( 0, 11).Trim()
        $firstName =         $line.Substring(11, 24).Trim()
        $lastName =          $line.Substring(35, 19).Trim()
        $columnNameTooLong = $line.Substring(54, 17).Trim()
        $fifthColumn =       $line.Substring(71,  9).Trim()
    
        if ($lastName -eq '')
        {
            $lastName = '0'
        }
    
        $new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
        $stream_out.WriteLine($new_line)
    }
    
    $stream_out.ToString()
    

    更好的是,由于知道每一行的长度,因此可以通过将每一行作为ReadLine()的块读取并从中提取字段来避免String的换行检查和后续的Char分配。

    function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
    {
        # If the first character of a field is whitespace, assume the
        # entire field is as well to avoid a String allocation and Trim()
        if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
            return '0'
        } else {
            # Create a String from the span of Chars at known boundaries and trim it
            return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
        }
    }
    
    $proc_yyyymm = '201912'
    $delimiter = ','
    
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    
    $lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
    $lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
    
    while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
    {
        $id                = ExtractField $lineChars  0 11
        $firstName         = ExtractField $lineChars 11 24
        $lastName          = ExtractField $lineChars 35 19 $true
        $columnNameTooLong = ExtractField $lineChars 54 17
        $fifthColumn       = ExtractField $lineChars 71  9
    
        # Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
        $stream_out.Write($proc_yyyymm)
        $stream_out.Write($delimiter)
        $stream_out.Write($id)
        $stream_out.Write($delimiter)
        $stream_out.Write($firstName)
        $stream_out.Write($delimiter)
        $stream_out.Write($lastName)
        $stream_out.Write($delimiter)
        $stream_out.Write($columnNameTooLong)
        $stream_out.Write($delimiter)
        $stream_out.WriteLine($fifthColumn)
    }
    
    $stream_out.ToString()
    

    由于@HAL9256's answer确认PowerShell函数的速度非常慢,因此在没有冗余代码且没有函数的情况下执行相同操作的方法是定义字段描述符的集合并对其进行循环以从适当的偏移量提取每个字段...

    $proc_yyyymm = '201912'
    $delimiter = ','
    
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    
    $lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
    $lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
    
    # This could also be done with 'Offset,Length,NormalizeIfEmpty' | ConvertFrom-Csv
    # The Offset property could be omitted in favor of calculating it in the loop
    # based on the Length, however this way A) avoids the extra variable/addition,
    # B) allows fields to be ignored if desired, and C) allows fields to be output
    # in a different order than the input.
    $fieldDescriptors = @(
        @{ Offset =  0; Length = 11; NormalizeIfEmpty = $false },
        @{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
        @{ Offset = 35; Length = 19; NormalizeIfEmpty = $true  },
        @{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
        @{ Offset = 71; Length =  9; NormalizeIfEmpty = $false }
    ) | ForEach-Object -Process { [PSCustomObject] $_ }
    
    while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
    {
        $stream_out.Write($proc_yyyymm)
    
        foreach ($fieldDescriptor in $fieldDescriptors)
        {
            # If the first character of a field is whitespace, assume the
            # entire field is as well to avoid a String allocation and Trim()
            # If space is the only possible whitespace character,
            # $lineChars[$fieldDescriptor.Offset] -eq [Char] ' ' may be faster than IsWhiteSpace()
            $fieldText = if ($fieldDescriptor.NormalizeIfEmpty `
                -and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
            ) {
                '0'
            } else {
                # Create a String from the span of Chars at known boundaries and trim it
                (
                    New-Object -TypeName 'String' -ArgumentList (
                        $lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
                    )
                ).Trim()
            }
    
            $stream_out.Write($delimiter)
            $stream_out.Write($fieldText)
        }
    
        $stream_out.WriteLine()
    }
    
    $stream_out.ToString()
    

    我以为直接字符串提取比正则表达式快,但是我不知道一般说来是$true,因为它与PowerShell有关;更不用说了。只有测试才能证明这一点。

    以上所有解决方案均产生以下输出...
    201912,ID,FIRST_NAME,LAST_NAME,COLUMN_NM_TOO_LON,5THCOLUMN
    201912,10000000001,MINNIE,MOUSE,COLUMN VALUE LONG,STARTS
    201912,10000000002,PLUTO,0,COLUMN VALUE LONG,STARTS
    

    关于regex - 使用Powershell和Regex解析固定长度的字段文件,如何将空捕获组替换为零?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59292588/

    相关文章:

    Notepad++ 中的正则表达式(如 sed)

    css - css文件中的属性顺序对其压缩有多大影响

    powershell - 一行输出到txt和cvs

    c# - 正则表达式拉丁字符过滤器和非拉丁字符过滤器

    regex - 如何从差异比较中查找和排除字符串正则表达式文字?

    javascript - 在 Google Chrome 扩展 API 中使用 declarativeContent 隐藏 pageAction

    objective-c - 测量 iOS 应用程序启动性能的规范方法?

    performance - 并行成本和并行工作有什么区别?

    PowerShell - 在另一个字符串之后的下一行在文本文件中插入字符串

    powershell - 如何查询 Hashtable 的嵌套属性?