powershell - 无需BOM将源转换为UTF-8

标签 powershell utf-8 character-encoding

我尝试将所有源文件从目标文件夹转换为UTF-8(无BOM)编码。
我使用以下PowerShell脚本:

$MyPath = "D:\my projects\etc\"
Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
    $content = Get-Content $_.FullName  
    $Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
    [System.IO.File]::WriteAllLines($_.FullName, $content, $Utf8NoBomEncoding)    
}
cmd /c pause | out-null

如果文件在UTF-8中已经是而不是,则可以正常工作。但是,如果某个文件已经存在于UTF-8 no-BOM中,则所有本国符号都将转换为未知符号(例如,如果我再次运行该脚本)。如何更改脚本以解决问题?

最佳答案

正如Ansgar Wiechers在评论中指出的那样,问题在于 Windows PowerShell在没有BOM的情况下默认将文件解释为“ANSI”编码的,即,传统系统区域设置(ANSI代码页)所隐含的编码。 ,如 [System.Text.Encoding]::Default 中的.NET Framework(但不是.NET Core)所反射(reflect)。

鉴于此,根据您的后续评论,输入文件中的不含BOM的文件是Windows-1251编码文件和UTF-8文件的混合,您必须检查其内容以确定其特定编码:

  • 使用-Encoding Utf8读取每个文件,并测试结果字符串是否包含Unicode REPLACEMENT CHARACTER ( U+FFFD )。如果是这样,则表示文件不是UTF-8,因为此特殊字符用于表示遇到了在UTF-8中无效的字节序列。
  • 如果文件不是有效的UTF-8,只需重新读取文件而不指定-Encoding,这将导致Windows PowerShell将文件解释为Windows-1251编码,因为这就是系统所隐含的编码(代码页)语言环境。

  • $MyPath = "D:\my projects\etc"
    Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
        # Note:
        #  * the use of -Encoding Utf8 to first try to read the file as UTF-8.
        #  * the use of -Raw to read the entire file as a *single string*.
        $content = Get-Content -Raw -Encoding Utf8 $_.FullName  
    
        # If the replacement char. is found in the content, the implication
        # is that the file is NOT UTF-8, so read it again *without -Encoding*,
        # which interprets the files as "ANSI" encoded (Windows-1251, in your case).
        if ($content.Contains([char] 0xfffd)) {
          $content = Get-Content -Raw $_.FullName  
        }
    
        # Note the use of WriteAllText() in lieu of WriteAllLines()
        # and that no explicit encoding object is passed, given that
        # .NET *defaults* to BOM-less UTF-8.
        # CAVEAT: There's a slight risk of data loss if writing back to the input
        #         file is interrupted.
        [System.IO.File]::WriteAllText($_.FullName, $content)    
    }
    

    更快的替代是将[IO.File]::ReadAllText()与UTF-8编码对象一起使用,当遇到int-as-UTF-8字节无效时,该对象会引发异常(PSv5 +语法):
    $utf8EncodingThatThrows = [Text.UTF8Encoding]::new($false, $true)
    
    # ...
    
      try {
         $content = [IO.File]::ReadAllText($_.FullName, $utf8EncodingThatThrows)
      } catch [Text.DecoderFallbackException] {         
         $content = [IO.File]::ReadAllText($_.FullName, [Text.Encoding]::Default)
      }
    
    # ...
    

    使以上解决方案适应PowerShell Core / .NET Core:
  • PowerShell Core默认为(无BOM)UTF-8,因此,仅删除-Encoding不适用于读取ANSI编码的文件。
  • 同样,[System.Text.Encoding]::Default始终报告.NET Core中的UTF-8。

  • 因此,您必须手动确定 Activity 系统区域设置的ANSI代码页,并获取相应的编码对象:
    $ansiEncoding = [Text.Encoding]::GetEncoding(
      [int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP)
    )
    

    然后,您需要将此编码明确传递给Get-Content -Encoding(Get-Content -Raw -Encoding $ansiEncoding $_.FullName)或.NET方法([IO.File]::ReadAllText($_.FullName, $ansiEncoding))。

    答案的原始形式:对于已经全部进行UTF-8编码的输入文件:

    因此,如果您的某些UTF-8编码文件(已经)没有BOM,则使用,您必须使用Get-Content 明确指示-Encoding Utf8将其视为UTF-8-否则,如果它们包含字符外的字符,它们将被误解。 7位ASCII范围:
    $MyPath = "D:\my projects\etc"
    Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
        # Note:
        #  * the use of -Encoding Utf8 to ensure the correct interpretation of the input file
        #  * the use of -Raw to read the entire file as a *single string*.
        $content = Get-Content -Raw -Encoding Utf8 $_.FullName  
    
        # Note the use of WriteAllText() in lieu of WriteAllLines()
        # and that no explicit encoding object is passed, given that
        # .NET *defaults* to BOM-less UTF-8.
        # CAVEAT: There's a slight risk of data loss if writing back to the input
        #         file is interrupted.
        [System.IO.File]::WriteAllText($_.FullName, $content)    
    }
    

    注意:在您的方案中,无需BOM的UTF-8文件不需要重写,但是这样做是有益的,并且可以简化代码。 替代,用于测试每个文件的前3个字节是否为UTF-8 BOM ,并跳过这样的文件:$hasUtf8Bom = "$(Get-Content -Encoding Byte -First 3 $_.FullName)" -eq '239 187 191'(Windows PowerShell)或$hasUtf8Bom = "$(Get-Content -AsByteStream -First 3 $_.FullName)" -eq '239 187 191'(PowerShell核心)。

    顺便说一句:如果存在使用非UTF8编码的输入文件(例如UTF-16),则只要这些文件具有BOM,该解决方案仍然可以使用,因为 PowerShell(安静地)将BOM优先于编码通过-Encoding 指定。

    请注意,使用-Raw / WriteAllText()整体读取/写入文件(单个字符串)不仅可以加快处理速度,而且可以确保保留每个输入文件的以下特征:
  • 特定的换行样式(CRLF(Windows)与仅LF(Unix))
  • 最后一行是否有尾随换行符。

  • 相比之下,不使用-Raw(逐行阅读)和使用.WriteAllLines()不会保留这些特征:您总是获得适合于平台的换行符(在Windows PowerShell中,始终为CRLF),并且总是出现在尾随的换行符上。

    请注意,当读取不带BOM的文件时,多平台 Powershell Core 明智地默认为UTF-8 ,而默认情况下也会创建无BOM的UTF-8文件-创建带有BOM的UTF-8文件使用-Encoding utf8BOM显式选择加入。

    因此,PowerShell Core解决方案更加简单:

    # PowerShell Core only.
    
    $MyPath = "D:\my projects\etc"
    Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
        # * Read the file at hand (UTF8 files both with and without BOM are 
        #   read correctly).
        # * Simply rewrite it with the *default* encoding, which in 
        #   PowerShell Core is BOM-less UTF-8.
        # Note the (...) around the Get-Content call, which is necessary in order
        # to write back to the *same* file in the same pipeline.
        # CAVEAT: There's a slight risk of data loss if writing back to the input
        #         file is interrupted.
        (Get-Content -Raw $_.FullName) | Set-Content -NoNewline $_.FullName
    }
    

    更快的基于.NET类型的解决方案

    上面的解决方案可以工作,但是 Get-ContentSet-Content相对较慢,因此使用.NET类型读取和重写文件会更好。

    如上所述,在以下解决方案中(即使在Windows PowerShell中也不必须显式指定而不是编码),因为值得一提的是,自 .NET初始以来,.NET本身默认为无BOM的UTF-8(尽管仍可识别UTF-8) BOM(如果存在):
    $MyPath = "D:\my projects\etc"
    Get-ChildItem $MyPath\* -Include *.h, *.cpp, *.c | Foreach-Object {
      # CAVEAT: There's a slight risk of data loss if writing back to the input
      #         file is interrupted.
      [System.IO.File]::WriteAllText(
        $_.FullName,
        [System.IO.File]::ReadAllText($_.FullName)
      )   
    }
    

    关于powershell - 无需BOM将源转换为UTF-8,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54534765/

    相关文章:

    http - 请帮助我跟踪如何在每一步处理字符集

    Java解码双编码utf-8字符

    utf-8 - 如何让 Tanuki Wrapper 日志文件采用 UTF-8 编码?

    mysql - Mysql中使用HEX()转换字符编码

    powershell - 如何通过Powershell启用Azure Cosmos DB预览功能(聚合管道和Mongodbv3.4)?

    python - Anaconda 提示 "Failed to create temp directory ' C :\temp\conda-<RANDOM>\' " error

    Mysql2::错误:不正确的字符串值 Rails 3 UTF8

    C++,网络浏览器控件 : cannot change encoding/charset

    powershell - 保存自定义 $variables 以在不同的 PS session 中访问

    windows - PowerShell:跟踪日志文件,并将结果发送到 Windows 系统事件日志