regex - 使用Powershell和Regex解析固定长度的字段文件,如何将空捕获组替换为零?

标签 regex performance powershell parsing

我正在使用PowerShell脚本和Regex将巨大的(> 1GB)固定字段长度文本文件转换为可导入的制表符分隔文件。代码非常快。如果修剪后它们为空,则需要将某些捕获的字段(假设是第4,第6和第7个字段)更改为0。作为正则表达式捕获的一部分,是否有一种超快的方法来做到这一点,而又不会大大减慢该过程呢?


10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS 


$proc_yyyymm = '201912'
match_regex = '^(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{8})(.{10})(.{1})(.{15})(.{12})'

while ($line = $stream_in.ReadLine()) {

   if ($line -match $match_data_regex) {
      $new_line = "$proc_yyyymm`t" + ($Matches[1..($Matches.Count-1)].Trim() -join "`t")



  • 截断正则表达式以匹配示例数据
  • 将输出定界符(现在为$delimiter)更改为,,以便易于查看结果
  • 使用StringReaderStringWriter分别输入和输出

  • 给...

    $text = @'
    ID         FIRST_NAME              LAST_NAME          COLUMN_NM_TOO_LON5THCOLUMN
    10000000001MINNIE                  MOUSE              COLUMN VALUE LONGSTARTS   
    10000000002PLUTO                                      COLUMN VALUE LONGSTARTS   


    $proc_yyyymm = '201912'
    $match_regex = '^(.{11})(.{24})(.{19})(.{17})(.{9})'
    $delimiter = ','
    $indicesToNormalizeToZero = ,2
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    while ($line = $stream_in.ReadLine()) {
        if ($line -match $match_regex) {
            $trimmedMatches = $Matches[1..($Matches.Count-1)].Trim()
            foreach ($index in $indicesToNormalizeToZero)
                if ($trimmedMatches[$index] -eq '')
                    $trimmedMatches[$index] = '0'
            $new_line = "$proc_yyyymm$delimiter" + ($trimmedMatches -join $delimiter)

    一种替代方法是使用 [Regex]::Replace() method。当您需要对无法用regex substitution表达的匹配项执行自定义转换时,这非常有用。诚然,这可能不合适,因为您要匹配整行而不是单个字段,因此在匹配中,您需要知道哪个字段是哪个字段。

    $proc_yyyymm = '201912'
    $match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'
    $match_evaluator = {
        # The first element of Groups contains the entire matched text; skip it
        $fields = $match.Groups `
            | Select-Object -Skip 1 `
            | ForEach-Object -Process {
                $field = $_.Value.Trim()
                if ($groupsToNormalizeToZero -contains $_.Name -and $field -eq '')
                    $field = '0'
                return $field
        return "$proc_yyyymm$delimiter" + ($fields -join $delimiter)
    $delimiter = ','
    # Replace with a HashSet/Hashtable for better lookup performance
    $groupsToNormalizeToZero = ,'3'
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    while ($line = $stream_in.ReadLine()) {
        $new_line = $match_regex.Replace($line, $match_evaluator)
        # The original input string is returned if there was no match
        if (-not [Object]::ReferenceEquals($line, $new_line)) {
    $match_evaluator MatchEvaluator delegate,在输入文本中找到的每个成功匹配项都会被调用到Replace(),并返回您想要替换文本的内容。在内部,我在进行相同类型的特定于索引的转换,将组名(将其作为[String]的索引)与已知列表($groupsToNormalizeToZero)进行比较;您可以改用命名组,尽管我发现这会更改$match.Groups的顺序。这里可能没有更好的[Regex]::Replace()应用程序,但我现在还没有想到。

    作为使用正则表达式的替代方法,由于已知它们的长度,因此您可以使用 $line method直接从Substring()中提取字段。

    $proc_yyyymm = '201912'
    $delimiter = ','
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    while ($line = $stream_in.ReadLine()) {
        $id =                $line.Substring( 0, 11).Trim()
        $firstName =         $line.Substring(11, 24).Trim()
        $lastName =          $line.Substring(35, 19).Trim()
        $columnNameTooLong = $line.Substring(54, 17).Trim()
        $fifthColumn =       $line.Substring(71,  9).Trim()
        if ($lastName -eq '')
            $lastName = '0'
        $new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter


    function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
        # If the first character of a field is whitespace, assume the
        # entire field is as well to avoid a String allocation and Trim()
        if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
            return '0'
        } else {
            # Create a String from the span of Chars at known boundaries and trim it
            return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
    $proc_yyyymm = '201912'
    $delimiter = ','
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    $lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
    $lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
    while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
        $id                = ExtractField $lineChars  0 11
        $firstName         = ExtractField $lineChars 11 24
        $lastName          = ExtractField $lineChars 35 19 $true
        $columnNameTooLong = ExtractField $lineChars 54 17
        $fifthColumn       = ExtractField $lineChars 71  9
        # Are all these method calls better or worse than a single WriteLine() and object allocation(s)?

    由于@HAL9256's answer确认PowerShell函数的速度非常慢,因此在没有冗余代码且没有函数的情况下执行相同操作的方法是定义字段描述符的集合并对其进行循环以从适当的偏移量提取每个字段...

    $proc_yyyymm = '201912'
    $delimiter = ','
    $stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
    $stream_out = New-Object -TypeName 'System.IO.StringWriter'
    $lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
    $lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
    # This could also be done with 'Offset,Length,NormalizeIfEmpty' | ConvertFrom-Csv
    # The Offset property could be omitted in favor of calculating it in the loop
    # based on the Length, however this way A) avoids the extra variable/addition,
    # B) allows fields to be ignored if desired, and C) allows fields to be output
    # in a different order than the input.
    $fieldDescriptors = @(
        @{ Offset =  0; Length = 11; NormalizeIfEmpty = $false },
        @{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
        @{ Offset = 35; Length = 19; NormalizeIfEmpty = $true  },
        @{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
        @{ Offset = 71; Length =  9; NormalizeIfEmpty = $false }
    ) | ForEach-Object -Process { [PSCustomObject] $_ }
    while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
        foreach ($fieldDescriptor in $fieldDescriptors)
            # If the first character of a field is whitespace, assume the
            # entire field is as well to avoid a String allocation and Trim()
            # If space is the only possible whitespace character,
            # $lineChars[$fieldDescriptor.Offset] -eq [Char] ' ' may be faster than IsWhiteSpace()
            $fieldText = if ($fieldDescriptor.NormalizeIfEmpty `
                -and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
            ) {
            } else {
                # Create a String from the span of Chars at known boundaries and trim it
                    New-Object -TypeName 'String' -ArgumentList (
                        $lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length


    201912,10000000002,PLUTO,0,COLUMN VALUE LONG,STARTS

    关于regex - 使用Powershell和Regex解析固定长度的字段文件,如何将空捕获组替换为零?,我们在Stack Overflow上找到一个类似的问题:


    Notepad++ 中的正则表达式(如 sed)

    css - css文件中的属性顺序对其压缩有多大影响

    powershell - 一行输出到txt和cvs

    c# - 正则表达式拉丁字符过滤器和非拉丁字符过滤器

    regex - 如何从差异比较中查找和排除字符串正则表达式文字?

    javascript - 在 Google Chrome 扩展 API 中使用 declarativeContent 隐藏 pageAction

    objective-c - 测量 iOS 应用程序启动性能的规范方法?

    performance - 并行成本和并行工作有什么区别?

    PowerShell - 在另一个字符串之后的下一行在文本文件中插入字符串

    powershell - 如何查询 Hashtable 的嵌套属性?