我正在使用PowerShell脚本和Regex将巨大的(> 1GB)固定字段长度文本文件转换为可导入的制表符分隔文件。代码非常快。如果修剪后它们为空,则需要将某些捕获的字段(假设是第4,第6和第7个字段)更改为0。作为正则表达式捕获的一部分,是否有一种超快的方法来做到这一点,而又不会大大减慢该过程呢?
DATA
ID FIRST_NAME LAST_NAME COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE MOUSE COLUMN VALUE LONGSTARTS
PROGRAM
$proc_yyyymm = '201912'
match_regex = '^(.{10})(.{10})(.{30})(.{30})(.{30})(.{4})(.{8})(.{10})(.{1})(.{15})(.{12})'
while ($line = $stream_in.ReadLine()) {
if ($line -match $match_data_regex) {
$new_line = "$proc_yyyymm`t" + ($Matches[1..($Matches.Count-1)].Trim() -join "`t")
$stream_out.WriteLine($new_line)
}
}
最佳答案
在对代码进行一些调整以进行演示之后...
$delimiter
)更改为,
,以便易于查看结果StringReader
和StringWriter
分别输入和输出给...
$text = @'
ID FIRST_NAME LAST_NAME COLUMN_NM_TOO_LON5THCOLUMN
10000000001MINNIE MOUSE COLUMN VALUE LONGSTARTS
10000000002PLUTO COLUMN VALUE LONGSTARTS
'@
...您建议的在特定索引处调整匹配文本的方式看起来像这样...
$proc_yyyymm = '201912'
$match_regex = '^(.{11})(.{24})(.{19})(.{17})(.{9})'
$delimiter = ','
$indicesToNormalizeToZero = ,2
$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'
while ($line = $stream_in.ReadLine()) {
if ($line -match $match_regex) {
$trimmedMatches = $Matches[1..($Matches.Count-1)].Trim()
foreach ($index in $indicesToNormalizeToZero)
{
if ($trimmedMatches[$index] -eq '')
{
$trimmedMatches[$index] = '0'
}
}
$new_line = "$proc_yyyymm$delimiter" + ($trimmedMatches -join $delimiter)
$stream_out.WriteLine($new_line)
}
}
$stream_out.ToString()
一种替代方法是使用
[Regex]::Replace()
method。当您需要对无法用regex substitution表达的匹配项执行自定义转换时,这非常有用。诚然,这可能不合适,因为您要匹配整行而不是单个字段,因此在匹配中,您需要知道哪个字段是哪个字段。$proc_yyyymm = '201912'
$match_regex = [Regex] '^(.{11})(.{24})(.{19})(.{17})(.{9})'
$match_evaluator = {
param($match)
# The first element of Groups contains the entire matched text; skip it
$fields = $match.Groups `
| Select-Object -Skip 1 `
| ForEach-Object -Process {
$field = $_.Value.Trim()
if ($groupsToNormalizeToZero -contains $_.Name -and $field -eq '')
{
$field = '0'
}
return $field
}
return "$proc_yyyymm$delimiter" + ($fields -join $delimiter)
}
$delimiter = ','
# Replace with a HashSet/Hashtable for better lookup performance
$groupsToNormalizeToZero = ,'3'
$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'
while ($line = $stream_in.ReadLine()) {
$new_line = $match_regex.Replace($line, $match_evaluator)
# The original input string is returned if there was no match
if (-not [Object]::ReferenceEquals($line, $new_line)) {
$stream_out.WriteLine($new_line)
}
}
$stream_out.ToString()
$match_evaluator
是 MatchEvaluator
delegate,在输入文本中找到的每个成功匹配项都会被调用到Replace()
,并返回您想要替换文本的内容。在内部,我在进行相同类型的特定于索引的转换,将组名(将其作为[String]
的索引)与已知列表($groupsToNormalizeToZero
)进行比较;您可以改用命名组,尽管我发现这会更改$match.Groups
的顺序。这里可能没有更好的[Regex]::Replace()
应用程序,但我现在还没有想到。作为使用正则表达式的替代方法,由于已知它们的长度,因此您可以使用
$line
method直接从Substring()
中提取字段。$proc_yyyymm = '201912'
$delimiter = ','
$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'
while ($line = $stream_in.ReadLine()) {
$id = $line.Substring( 0, 11).Trim()
$firstName = $line.Substring(11, 24).Trim()
$lastName = $line.Substring(35, 19).Trim()
$columnNameTooLong = $line.Substring(54, 17).Trim()
$fifthColumn = $line.Substring(71, 9).Trim()
if ($lastName -eq '')
{
$lastName = '0'
}
$new_line = $proc_yyyymm,$id,$firstName,$lastName,$columnNameTooLong,$fifthColumn -join $delimiter
$stream_out.WriteLine($new_line)
}
$stream_out.ToString()
更好的是,由于知道每一行的长度,因此可以通过将每一行作为
ReadLine()
的块读取并从中提取字段来避免String
的换行检查和后续的Char
分配。function ExtractField($chars, $startIndex, $length, $normalizeIfFirstCharWhitespace = $false)
{
# If the first character of a field is whitespace, assume the
# entire field is as well to avoid a String allocation and Trim()
if ($normalizeIfFirstCharWhitespace -and [Char]::IsWhiteSpace($chars[$startIndex])) {
return '0'
} else {
# Create a String from the span of Chars at known boundaries and trim it
return (New-Object -TypeName 'String' -ArgumentList ($chars, $startIndex, $length)).Trim()
}
}
$proc_yyyymm = '201912'
$delimiter = ','
$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'
$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
$id = ExtractField $lineChars 0 11
$firstName = ExtractField $lineChars 11 24
$lastName = ExtractField $lineChars 35 19 $true
$columnNameTooLong = ExtractField $lineChars 54 17
$fifthColumn = ExtractField $lineChars 71 9
# Are all these method calls better or worse than a single WriteLine() and object allocation(s)?
$stream_out.Write($proc_yyyymm)
$stream_out.Write($delimiter)
$stream_out.Write($id)
$stream_out.Write($delimiter)
$stream_out.Write($firstName)
$stream_out.Write($delimiter)
$stream_out.Write($lastName)
$stream_out.Write($delimiter)
$stream_out.Write($columnNameTooLong)
$stream_out.Write($delimiter)
$stream_out.WriteLine($fifthColumn)
}
$stream_out.ToString()
由于@HAL9256's answer确认PowerShell函数的速度非常慢,因此在没有冗余代码且没有函数的情况下执行相同操作的方法是定义字段描述符的集合并对其进行循环以从适当的偏移量提取每个字段...
$proc_yyyymm = '201912'
$delimiter = ','
$stream_in = New-Object -TypeName 'System.IO.StringReader' -ArgumentList $text
$stream_out = New-Object -TypeName 'System.IO.StringWriter'
$lineLength = 82 # Assumes the last line ends with an \r\n and not EOF
$lineChars = New-Object -TypeName 'Char[]' -ArgumentList $lineLength
# This could also be done with 'Offset,Length,NormalizeIfEmpty' | ConvertFrom-Csv
# The Offset property could be omitted in favor of calculating it in the loop
# based on the Length, however this way A) avoids the extra variable/addition,
# B) allows fields to be ignored if desired, and C) allows fields to be output
# in a different order than the input.
$fieldDescriptors = @(
@{ Offset = 0; Length = 11; NormalizeIfEmpty = $false },
@{ Offset = 11; Length = 24; NormalizeIfEmpty = $false },
@{ Offset = 35; Length = 19; NormalizeIfEmpty = $true },
@{ Offset = 54; Length = 17; NormalizeIfEmpty = $false },
@{ Offset = 71; Length = 9; NormalizeIfEmpty = $false }
) | ForEach-Object -Process { [PSCustomObject] $_ }
while (($lastReadCount = $stream_in.ReadBlock($lineChars, 0, $lineLength)) -gt 0)
{
$stream_out.Write($proc_yyyymm)
foreach ($fieldDescriptor in $fieldDescriptors)
{
# If the first character of a field is whitespace, assume the
# entire field is as well to avoid a String allocation and Trim()
# If space is the only possible whitespace character,
# $lineChars[$fieldDescriptor.Offset] -eq [Char] ' ' may be faster than IsWhiteSpace()
$fieldText = if ($fieldDescriptor.NormalizeIfEmpty `
-and [Char]::IsWhiteSpace($lineChars[$fieldDescriptor.Offset])
) {
'0'
} else {
# Create a String from the span of Chars at known boundaries and trim it
(
New-Object -TypeName 'String' -ArgumentList (
$lineChars, $fieldDescriptor.Offset, $fieldDescriptor.Length
)
).Trim()
}
$stream_out.Write($delimiter)
$stream_out.Write($fieldText)
}
$stream_out.WriteLine()
}
$stream_out.ToString()
我以为直接字符串提取比正则表达式快,但是我不知道一般说来是
$true
,因为它与PowerShell有关;更不用说了。只有测试才能证明这一点。以上所有解决方案均产生以下输出...
201912,ID,FIRST_NAME,LAST_NAME,COLUMN_NM_TOO_LON,5THCOLUMN
201912,10000000001,MINNIE,MOUSE,COLUMN VALUE LONG,STARTS
201912,10000000002,PLUTO,0,COLUMN VALUE LONG,STARTS
关于regex - 使用Powershell和Regex解析固定长度的字段文件,如何将空捕获组替换为零?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59292588/