c# - 替换CSV文件中的列逗号分隔符,并使用值前后的单引号处理字段

标签 c# powershell

系统正在生成一个我没有影响的csv文件。

如果数据本身包含逗号,则可以在两列中将值括在一对单引号中。

示例数据-4列

123,'abc,def,ghf',ajajaj,1 
345,abdf,'abc,def,ghi',2
556,abdf,def,3
999,'a,b,d','d,e,f',4

结果我想使用Powershell ...

不在数据中的逗号-表示分隔字段的那些逗号将替换为指定的定界符(在pipe-star下方的情况下)。一对单引号之间的逗号仍然保留为逗号。

结果
123|*'abc,def,ghf'|*ajajaj|*1 
345|*abdf|*'abc,def,ghi'|*2
556|*abdf|*def|*3
999|*'a,b,d'|*'d,e,f'|*4

如果可能,我想使用reg表达式来执行power-shell或C#net,但是我不知道该怎么做。

最佳答案

尽管我认为这会创建格式异常的CSV文件,但是使用PowerShell可以将switch-Regex-File参数一起使用。这可能是处理大文件的最快方法,并且只需要几行代码:

# create a regex that will find comma's unless they are inside single quotes
$commaUnlessQuoted = ",(?=([^']*'[^']*')*[^']*$)"

$result = switch -Regex -File 'D:\test.csv' {
    # added -replace "'" to also remove the single quotes as commented
    default { $_ -replace "$commaUnlessQuoted", '|*' -replace "'" }
}

# output to console
$result

# output to new (sort-of) CSV file
$result | Set-Content -Path 'D:\testoutput.csv'

更新

作为mklement0 pointed out,上面的代码可以完成工作,但是在将更新的数据创建为内存中的数组的过程中,完全将写入写入输出文件。
如果这是一个问题(文件太大而无法容纳可用的内存),您也可以更改代码以读取/替换原始行,然后将该行立即写到输出文件中。

下一种方法几乎不会耗尽任何内存,但是当然要在磁盘上执行更多写操作。
# make sure this is an absolute path for .NET
$outputFile = 'D:\output.csv'
$inputFile  = 'D:\input.csv'

# create a regex that will find comma's unless they are inside single quotes
$commaUnlessQuoted = ",(?=([^']*'[^']*')*[^']*$)"

# create a StreamWriter object. Uses UTF8Encoding without BOM (Byte Order Mark) by default.
# if you need a different encoding for the output file, use for instance
# $writer = [System.IO.StreamWriter]::new($outputFile, $false, [System.Text.Encoding]::Unicode)
$writer = [System.IO.StreamWriter]::new($outputFile)
switch -Regex -File $inputFile {
    default {
        # added -replace "'" to also remove the single quotes as commented
        $line = $_ -replace "$commaUnlessQuoted", '|*' -replace "'"
        $writer.WriteLine($line)
        # if you want, uncomment the next line to show on console
        # $line
    }
}

# remove the StreamWriter object from memory when done
$writer.Dispose()

结果:

123|*abc,def,ghf|*ajajaj|*1 
345|*abdf|*abc,def,ghi|*2
556|*abdf|*def|*3
999|*a,b,d|*d,e,f|*4


正则表达式详细信息:
,                 Match the character “,” literally
(?=               Assert that the regex below can be matched, starting at this position (positive lookahead)
   (              Match the regular expression below and capture its match into backreference number 1
      [^']        Match any character that is NOT a “'”
         *        Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      '           Match the character “'” literally
      [^']        Match any character that is NOT a “'”
         *        Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      '           Match the character “'” literally
   )*             Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^']           Match any character that is NOT a “'”
      *           Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   $              Assert position at the end of the string (or before the line break at the end of the string, if any)
)

关于c# - 替换CSV文件中的列逗号分隔符,并使用值前后的单引号处理字段,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60006585/

相关文章:

java - C# 和 Java 控制台编码和 PowerShell

wpf - Powershell 包 uri 对象

c# - 反序列化List Json C#中的List

c# - 更新整个 ObservableCollection 不会通知绑定(bind)控件

c# - 在 VS2010 中更改 Web 项目中的输出路径

windows - 使用 PowerShell 脚本为数百名用户设置执行策略

c# - 对多个连接使用 Socket.BeginAccept/EndAccept

c# - 从具有共享边的列表中获取行的特定值

powershell - 文件系统树的 "subfolder at any depth"的 Get-ChildItem 通配符

powershell - if语句中的Foreach循环