c# - 替换CSV文件中的列逗号分隔符,并使用值前后的单引号处理字段

标签 c# powershell

系统正在生成一个我没有影响的csv文件。

如果数据本身包含逗号,则可以在两列中将值括在一对单引号中。

示例数据-4列

123,'abc,def,ghf',ajajaj,1 
345,abdf,'abc,def,ghi',2
556,abdf,def,3
999,'a,b,d','d,e,f',4

结果我想使用Powershell ...

不在数据中的逗号-表示分隔字段的那些逗号将替换为指定的定界符(在pipe-star下方的情况下)。一对单引号之间的逗号仍然保留为逗号。

结果
123|*'abc,def,ghf'|*ajajaj|*1 
345|*abdf|*'abc,def,ghi'|*2
556|*abdf|*def|*3
999|*'a,b,d'|*'d,e,f'|*4

如果可能,我想使用reg表达式来执行power-shell或C#net,但是我不知道该怎么做。

最佳答案

尽管我认为这会创建格式异常的CSV文件,但是使用PowerShell可以将switch-Regex-File参数一起使用。这可能是处理大文件的最快方法,并且只需要几行代码:

# create a regex that will find comma's unless they are inside single quotes
$commaUnlessQuoted = ",(?=([^']*'[^']*')*[^']*$)"

$result = switch -Regex -File 'D:\test.csv' {
    # added -replace "'" to also remove the single quotes as commented
    default { $_ -replace "$commaUnlessQuoted", '|*' -replace "'" }
}

# output to console
$result

# output to new (sort-of) CSV file
$result | Set-Content -Path 'D:\testoutput.csv'

更新

作为mklement0 pointed out,上面的代码可以完成工作,但是在将更新的数据创建为内存中的数组的过程中,完全将写入写入输出文件。
如果这是一个问题(文件太大而无法容纳可用的内存),您也可以更改代码以读取/替换原始行,然后将该行立即写到输出文件中。

下一种方法几乎不会耗尽任何内存,但是当然要在磁盘上执行更多写操作。
# make sure this is an absolute path for .NET
$outputFile = 'D:\output.csv'
$inputFile  = 'D:\input.csv'

# create a regex that will find comma's unless they are inside single quotes
$commaUnlessQuoted = ",(?=([^']*'[^']*')*[^']*$)"

# create a StreamWriter object. Uses UTF8Encoding without BOM (Byte Order Mark) by default.
# if you need a different encoding for the output file, use for instance
# $writer = [System.IO.StreamWriter]::new($outputFile, $false, [System.Text.Encoding]::Unicode)
$writer = [System.IO.StreamWriter]::new($outputFile)
switch -Regex -File $inputFile {
    default {
        # added -replace "'" to also remove the single quotes as commented
        $line = $_ -replace "$commaUnlessQuoted", '|*' -replace "'"
        $writer.WriteLine($line)
        # if you want, uncomment the next line to show on console
        # $line
    }
}

# remove the StreamWriter object from memory when done
$writer.Dispose()

结果:

123|*abc,def,ghf|*ajajaj|*1 
345|*abdf|*abc,def,ghi|*2
556|*abdf|*def|*3
999|*a,b,d|*d,e,f|*4


正则表达式详细信息:
,                 Match the character “,” literally
(?=               Assert that the regex below can be matched, starting at this position (positive lookahead)
   (              Match the regular expression below and capture its match into backreference number 1
      [^']        Match any character that is NOT a “'”
         *        Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      '           Match the character “'” literally
      [^']        Match any character that is NOT a “'”
         *        Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      '           Match the character “'” literally
   )*             Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^']           Match any character that is NOT a “'”
      *           Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   $              Assert position at the end of the string (or before the line break at the end of the string, if any)
)

关于c# - 替换CSV文件中的列逗号分隔符,并使用值前后的单引号处理字段,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60006585/

相关文章:

c# - 如何在 WPF 中向形状添加文本

rest - 使用 REST API 的 Azure API 管理的通知列表

powershell - 以编程方式创建参数 block 时无法生成 ParameterSetMetadata

Azure 文件共享无法使用 powershell 脚本连接?

powershell - 从 powershell 中的 Get-Winevent 消息中选择特定的行/数据

azure - 在 Azure DevOps Powershell 管道任务中获取自己的服务主体名称

c# - 处理所有实现 IDisposable 的嵌套对象

c# - 将 C# 函数转换为 Swift 4.2

c# - UI 线程中 Task.Wait 的替代方法

c# - 获取代理对象的底层类型