python - 使用 xlrd 打开 Excel 文件时出现编码错误

我正在尝试使用 xlrd 打开 Excel 文件 (.xls)。这是我正在使用的代码的摘要:

import xlrd
workbook = xlrd.open_workbook('thefile.xls')

这适用于大多数文件，但不适用于我从特定组织获得的文件。当我尝试打开来自该组织的 Excel 文件时出现的错误如下。

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 116, in open_workbook_xls
    bk.parse_globals()
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1180, in parse_globals
    self.handle_writeaccess(data)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/book.py", line 1145, in handle_writeaccess
    strg = unpack_unicode(data, 0, lenlen=2)
  File "/app/.heroku/python/lib/python2.7/site-packages/xlrd/biffh.py", line 303, in unpack_unicode
    strg = unicode(rawstrg, 'utf_16_le')
  File "/app/.heroku/python/lib/python2.7/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x40 in position 104: truncated data

这看起来好像 xlrd 正在尝试打开一个以非 UTF-16 编码的 Excel 文件。我怎样才能避免这个错误？文件是否以有缺陷的方式写入，或者是否只是一个特定的字符导致了问题？如果我打开并重新保存 Excel 文件，xlrd 可以毫无问题地打开文件。

我曾尝试使用不同的编码覆盖打开工作簿，但这也不起作用。

我要打开的文件可以在这里找到:

https://dl.dropboxusercontent.com/u/6779408/Stackoverflow/AEPUsageHistoryDetail_RequestID_00183816.xls

此处报告的问题:https://github.com/python-excel/xlrd/issues/128

最佳答案

他们使用什么来生成该文件？

他们正在使用一些 Java Excel API(见下文，link here)，可能在 IBM 大型机或类似设备上。

从堆栈跟踪中，写访问信息无法解码为 Unicode，因为 @ 字符。

有关 XLS 文件格式的写入访问信息的更多信息，请参阅 5.112 WRITEACCESS或 Page 277 .

此字段包含保存文件的用户的用户名。

import xlrd
dump = xlrd.dump('thefile.xls')

在原始文件上运行 xlrd.dump 得到

   36: 005c WRITEACCESS len = 0070 (112)
   40:      d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40  ????@?????@???@@
   56:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
   72:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
   88:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  104:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  120:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@
  136:      40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40  @@@@@@@@@@@@@@@@

在用 Excel 或在我的例子中是 LibreOffice Calc 重新保存后，写访问信息被类似的东西覆盖

 36: 005c WRITEACCESS len = 0070 (112)
 40:      04 00 00 43 61 6c 63 20 20 20 20 20 20 20 20 20  ?~~Calc         
 56:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
 72:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
 88:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
104:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
120:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20                  
136:      20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

基于编码为 40 的空格，我认为编码是 EBCDIC，当我们将 d1 81 a5 81 40 c5 a7 83 85 93 40 c1 d7 c9 40 40 转换为 EBCDIC 时，我们得到Java Excel API。

所以是的，在 BIFF8 和更高版本的情况下，文件以有缺陷的方式写入，它应该是一个 unicode 字符串，而在 BIFF3 到 BIFF5 中，它应该是 CODEPAGE 信息中编码的字节字符串

 152: 0042 CODEPAGE len = 0002 (2)
 156:      12 52                                            ?R

1252 是 Windows CP-1252 (Latin I) (BIFF4-BIFF5)，它不是 EBCDIC_037 .

xlrd 尝试使用 unicode 的事实意味着它确定文件的版本为 BIFF8。

在这种情况下，你有两个选择

在使用 xlrd 打开文件之前修复文件。您可以使用转储检查非标准输出的文件，如果是这种情况，您可以使用 xlutils.save 或其他库覆盖写入访问信息。
补丁 xlrd要处理您的特殊情况，请在 handle_writeaccess 中添加一个 try block 并在 unpack_unicode 失败时将 strg 设置为空字符串。

下面的片段

 def handle_writeaccess(self, data):
        DEBUG = 0
        if self.biff_version < 80:
            if not self.encoding:
                self.raw_user_name = True
                self.user_name = data
                return
            strg = unpack_string(data, 0, self.encoding, lenlen=1)
        else:
            try:
                strg = unpack_unicode(data, 0, lenlen=2)
            except:
                strg = ""
        if DEBUG: fprintf(self.logfile, "WRITEACCESS: %d bytes; raw=%s %r\n", len(data), self.raw_user_name, strg)
        strg = strg.rstrip()
        self.user_name = strg

与

workbook=xlrd.open_workbook('thefile.xls',encoding_override="cp1252")

似乎打开文件成功。

如果没有编码覆盖，它会提示 ERROR *** codepage 21010 -> encoding 'unknown_codepage_21010' -> LookupError: unknown encoding: unknown_codepage_21010

关于python - 使用 xlrd 打开 Excel 文件时出现编码错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28334966/

python - 使用 xlrd 打开 Excel 文件时出现编码错误

上一篇：python - 当我尝试安装 Flask-bcrypt 时它抛出错误 : command 'x86_64-linux-gnu-gcc' failed with exit status 1

下一篇：Python在开头和结尾加入字符