regex - 有没有一种方法可以优化我的 Powershell 函数以从大文件中删除模式匹配？

我有一个很大的文本文件(约 2 万行，每行约 80 个字符)。我还有一个较大的对象数组(约 1500 项)，其中包含我希望从大型文本文件中删除的模式。请注意，如果数组中的模式出现在输入文件的一行中，我希望删除整行，而不仅仅是模式。

输入文件是 CSVish，包含类似以下行:

A;AAA-BBB;XXX;XX000029;WORD;WORD-WORD-1;00001;STRING;2015-07-01;;010;

我在输入文件的每一行中搜索的数组中的模式类似于

XX000029

上面一行的一部分。

我实现这个目标的有点幼稚的功能目前看起来像这样:

function Remove-IdsFromFile {
  param(
    [Parameter(Mandatory=$true,Position=0)]
    [string]$BigFile,
    [Parameter(Mandatory=$true,Position=1)]
    [Object[]]$IgnorePatterns
  )

  try{
    $FileContent = Get-Content $BigFile
  }catch{
    Write-Error $_
  }

  $IgnorePatterns | ForEach-Object {
    $IgnoreId = $_.IgnoreId
    $FileContent = $FileContent | Where-Object { $_ -notmatch $IgnoreId }
    Write-Host $FileContent.count
  }
  $FileContent | Set-Content "CleansedBigFile.txt"
}

这有效，但是慢。

我怎样才能让它更快？

最佳答案

function Remove-IdsFromFile {
    param(
        [Parameter(Mandatory=$true,Position=0)]
        [string]$BigFile,
        [Parameter(Mandatory=$true,Position=1)]
        [Object[]]$IgnorePatterns
    )

    # Create the pattern matches
    $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"

    If(Test-Path $BigFile){
    $reader = New-Object  System.IO.StreamReader($BigFile)

    $line=$reader.ReadLine()
    while ($line -ne $null)
    {
        # Check if the line should be output to file
        If($line -notmatch $regex){$line | Add-Content "CleansedBigFile.txt"}

        # Attempt to read the next line. 
        $line=$reader.ReadLine()
    }

    $reader.close()

    } Else {
        Write-Error "Cannot locate: $BigFile"
    }
}

StreamReader 是读取大型文本文件的首选方法之一。我们还使用正则表达式来构建模式字符串以进行匹配。对于模式字符串，如果存在正则表达式控制字符，我们使用 [regex]::Escape() 作为预防措施。必须猜测，因为我们只看到一个模式字符串。

如果 $IgnorePatterns 可以很容易地转换为字符串，这应该可以正常工作。 $regex 的一小部分示例如下:

XX000029|XX000028|XX000027

如果 $IgnorePatterns 是从数据库中填充的，您可能对此控制较少，但由于我们使用的是正则表达式，您可能能够通过实际使用减少该模式集正则表达式(而不仅仅是一个大的替代匹配)就像我上面的例子一样。例如，您可以将其减少到 XX00002[7-9]。

我不知道正则表达式本身是否会提供 1500 种可能的性能提升。 StreamReader 应该是这里的焦点。然而，我确实通过对输出使用 Add-Content 来玷污水域，这也没有因为速度快而获得任何奖励(可以在其位置使用流编写器)。

读者和作家

我仍然需要对其进行测试以确保其正常工作，但这仅使用了 streamreader 和 streamwriter。如果它确实工作得更好，我将替换上面的代码。

function Remove-IdsFromFile {
    param(
        [Parameter(Mandatory=$true,Position=0)]
        [string]$BigFile,
        [Parameter(Mandatory=$true,Position=1)]
        [Object[]]$IgnorePatterns
    )

    # Create the pattern matches
    $regex = ($IgnorePatterns | ForEach-Object{[regex]::Escape($_)}) -join "|"

    If(Test-Path $BigFile){
        # Prepare the StreamReader
        $reader = New-Object System.IO.StreamReader($BigFile)

        #Prepare the StreamWriter
        $writer = New-Object System.IO.StreamWriter("CleansedBigFile.txt")

        $line=$reader.ReadLine()
        while ($line -ne $null)
        {
            # Check if the line should be output to file
            If($line -notmatch $regex){$writer.WriteLine($line)}

            # Attempt to read the next line. 
            $line=$reader.ReadLine()
        }

        # Don't cross the streams!
        $reader.Close()
        $writer.Close()

    } Else {
        Write-Error "Cannot locate: $BigFile"
    }
}

对于流，您可能需要在其中进行一些错误预防，但它确实可以正常工作。

关于regex - 有没有一种方法可以优化我的 Powershell 函数以从大文件中删除模式匹配？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31674667/

regex - 有没有一种方法可以优化我的 Powershell 函数以从大文件中删除模式匹配？

上一篇：nhibernate - NHibernate QueryOver与ManytoMany

下一篇：r - 如何将优化用作求解器？