windows - 将可能包含非 ASCII Unicode 字符的 PowerShell 输出解码为 Python 字符串

我需要将从 Python 调用的 PowerShell 标准输出解码为 Python 字符串。

我的最终目标是以字符串列表的形式获取 Windows 上网络适配器的名称。我当前的函数如下所示，并且在 Windows 10 上使用英语运行良好:

def get_interfaces():
    ps = subprocess.Popen(['powershell', 'Get-NetAdapter', '|', 'select Name', '|', 'fl'], stdout = subprocess.PIPE)
    stdout, stdin = ps.communicate(timeout = 10)
    interfaces = []
    for i in stdout.split(b'\r\n'):
        if not i.strip():
            continue
        if i.find(b':')<0:
            continue
        name, value = [ j.strip() for j in i.split(b':') ]
        if name == b'Name':
            interfaces.append(value.decode('ascii')) # This fails for other users
    return interfaces

其他用户使用不同的语言，因此 value.decode('ascii') 对其中一些用户失败。例如。一位用户报告说，更改为 decode('ISO 8859-2') 对他来说效果很好(因此它不是 UTF-8)。我如何知道编码以解码调用 PowerShell 返回的标准输出字节？

更新

经过一些实验，我更加困惑了。 chcp 在我的控制台中返回的代码页是 437。我将网络适配器名称更改为包含非 ASCII 和非 cp437 字符的名称。在运行 Get-NetAdapter | 的交互式 PowerShell session 中选择名称 | fl，它正确地显示了名称，甚至是它的非 CP437 字符。当我从 Python 调用 PowerShell 时，非 ASCII 字符被转换为最接近的 ASCII 字符(例如，ā 到 a，ž 到 z)并且 .decode(ascii) 工作得很好。此行为(以及相应的解决方案)是否取决于 Windows 版本？我使用的是 Windows 10，但用户可能使用的是旧版 Windows 直至 Windows 7。

最佳答案

输出字符编码可能取决于特定的命令，例如:

#!/usr/bin/env python3
import subprocess
import sys

encoding = 'utf-32'
cmd = r'''$env:PYTHONIOENCODING = "%s"; py -3 -c "print('\u270c')"''' % encoding
data = subprocess.check_output(["powershell", "-C", cmd])
print(sys.stdout.encoding)
print(data)
print(ascii(data.decode(encoding)))

输出

cp437
b"\xff\xfe\x00\x00\x0c'\x00\x00\r\x00\x00\x00\n\x00\x00\x00"
'\u270c\r\n'

✌ ( U+270C ) 字符接收成功。

子脚本的字符编码是在 PowerShell session 中使用 PYTHONIOENCODING envvar 设置的。我选择了 utf-32 作为输出编码，这样它就不同于用于演示的 Windows ANSI 和 OEM 代码页。

请注意，父 Python 脚本的标准输出编码是 OEM 代码页(在本例中为 cp437)——该脚本从 Windows 控制台运行。如果您将父 Python 脚本的输出重定向到文件/管道，则 Python 3 默认使用 ANSI 代码页(例如，cp1252)。

要解码可能包含当前 OEM 代码页中无法解码的字符的 powershell 输出，您可以临时设置 [Console]::OutputEncoding(受 @eryksun's comments 启发):

#!/usr/bin/env python3
import io
import sys
from subprocess import Popen, PIPE

char = ord('✌')
filename = 'U+{char:04x}.txt'.format(**vars())
with Popen(["powershell", "-C", '''
    $old = [Console]::OutputEncoding
    [Console]::OutputEncoding = [Text.Encoding]::UTF8
    echo $([char]0x{char:04x}) | fl
    echo $([char]0x{char:04x}) | tee {filename}
    [Console]::OutputEncoding = $old'''.format(**vars())],
           stdout=PIPE) as process:
    print(sys.stdout.encoding)
    for line in io.TextIOWrapper(process.stdout, encoding='utf-8-sig'):
        print(ascii(line))
print(ascii(open(filename, encoding='utf-16').read()))

输出

cp437
'\u270c\n'
'\u270c\n'
'\u270c\n'

fl 和 tee 都使用 [Console]::OutputEncoding 作为标准输出(默认行为就像是 | Write-输出 附加到管道)。 tee 使用 utf-16，将文本保存到文件中。输出显示✌(U+270C)解码成功。

$OutputEncoding 用于解码管道中间的字节:

#!/usr/bin/env python3
import subprocess

cmd = r'''
  $OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
  py -3 -c "import os; os.write(1, '\U0001f60a'.encode('utf-8')+b'\n')" |
  py -3 -c "import os; print(os.read(0, 512))"
'''
subprocess.check_call(["powershell", "-C", cmd])

输出

b'\xf0\x9f\x98\x8a\r\n'

这是正确的:b'\xf0\x9f\x98\x8a'.decode('utf-8') == u'\U0001f60a'。使用默认的 $OutputEncoding (ascii)，我们将得到 b'????\r\n'。

注意:

b'\n' 被替换为 b'\r\n' 尽管使用二进制 API，例如 os.read/os.write (msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) 在这里没有效果)

b'\r\n':

#!/usr/bin/env python3
from subprocess import check_output

cmd = '''py -3 -c "print('no newline in the input', end='')"'''
cat = '''py -3 -c "import os; os.write(1, os.read(0, 512))"'''  # pass as is
piped = check_output(['powershell', '-C', '{cmd} | {cat}'.format(**vars())])
no_pipe = check_output(['powershell', '-C', '{cmd}'.format(**vars())])
print('piped:   {piped}\nno pipe: {no_pipe}'.format(**vars()))

输出:

piped:   b'no newline in the input\r\n'
no pipe: b'no newline in the input'

换行符附加到管道输出。

如果我们忽略单独的代理项，那么设置 UTF8Encoding 允许通过管道传递所有 Unicode 字符，包括非 BMP 字符。如果配置了 $env:PYTHONIOENCODING = "utf-8:ignore"，则可以在 Python 中使用文本模式。

In interactive powershell running Get-NetAdapter | select Name | fl displayed correctly the name even its non-cp437 character.

如果未重定向标准输出，则使用 Unicode API，将字符打印到控制台——如果控制台 (TrueType) 字体支持，任何 [BMP] Unicode 字符都可以显示。

When I called powershell from python non-ascii characters were converted to closest ascii characters (e.g. ā to a, ž to z) and .decode(ascii) worked nicely.

这可能是由于 System.Text.InternalDecoderBestFitFallback 为 [Console]::OutputEncoding 设置的——如果 Unicode 字符不能在给定的编码中编码然后将其传递给回退(使用最适合的字符或 '?' 代替原始字符)。

Could this behavior (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.

如果我们忽略 cp65001 中的错误和后续版本支持的新编码列表，那么行为应该是相同的。

关于windows - 将可能包含非 ASCII Unicode 字符的 PowerShell 输出解码为 Python 字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33936074/

windows - 将可能包含非 ASCII Unicode 字符的 PowerShell 输出解码为 Python 字符串

输出

输出

输出

上一篇：windows - 如何在 Scala Play Web 应用程序中执行集成 Windows 身份验证 (IWA)

下一篇：windows - Windows 10 使用什么哈希算法来存储密码？