此代码是 Django 应用程序中代码的简化,该应用程序通过 HTTP 多部分 POST 接收上传的 zip 文件,并对内部数据进行只读处理:
#!/usr/bin/env python
import csv, sys, StringIO, traceback, zipfile
try:
import io
except ImportError:
sys.stderr.write('Could not import the `io` module.\n')
def get_zip_file(filename, method):
if method == 'direct':
return zipfile.ZipFile(filename)
elif method == 'StringIO':
data = file(filename).read()
return zipfile.ZipFile(StringIO.StringIO(data))
elif method == 'BytesIO':
data = file(filename).read()
return zipfile.ZipFile(io.BytesIO(data))
def process_zip_file(filename, method, open_defaults_file):
zip_file = get_zip_file(filename, method)
items_file = zip_file.open('items.csv')
csv_file = csv.DictReader(items_file)
try:
for idx, row in enumerate(csv_file):
image_filename = row['image1']
if open_defaults_file:
z = zip_file.open('defaults.csv')
z.close()
sys.stdout.write('Processed %d items.\n' % idx)
except zipfile.BadZipfile:
sys.stderr.write('Processing failed on item %d\n\n%s'
% (idx, traceback.format_exc()))
process_zip_file(sys.argv[1], sys.argv[2], int(sys.argv[3]))
很简单。我们打开 zip 文件和 zip 文件中的一两个 CSV 文件。
奇怪的是,如果我用一个大的 zip 文件(~13 MB)运行它并让它实例化
ZipFile
来自 StringIO.StringIO
或 io.BytesIO
(也许不是普通文件名?我在 Django 应用程序中尝试从 ZipFile
甚至是通过调用 TemporaryUploadedFile
和 os.tmpfile()
创建的文件对象创建 shutil.copyfileobj()
时遇到类似的问题)并打开它两个 csv 文件而不是一个,然后在处理结束时失败。这是我在 Linux 系统上看到的输出:$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip StringIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processing failed on item 242
Traceback (most recent call last):
File "./test_zip_file.py", line 26, in process_zip_file
for idx, row in enumerate(csv_file):
File ".../python2.7/csv.py", line 104, in next
row = self.reader.next()
File ".../python2.7/zipfile.py", line 523, in readline
return io.BufferedIOBase.readline(self, limit)
File ".../python2.7/zipfile.py", line 561, in peek
chunk = self.read(n)
File ".../python2.7/zipfile.py", line 581, in read
data = self.read1(n - len(buf))
File ".../python2.7/zipfile.py", line 641, in read1
self._update_crc(data, eof=eof)
File ".../python2.7/zipfile.py", line 596, in _update_crc
raise BadZipfile("Bad CRC-32 for file %r" % self.name)
BadZipfile: Bad CRC-32 for file 'items.csv'
$ ./test_zip_file.py ~/data.zip StringIO 0
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 0
Processed 250 items.
顺便说一句,代码在相同的条件下失败,但在我的 OS X 系统上以不同的方式失败。而不是
BadZipfile
异常(exception),它似乎读取了损坏的数据并且变得非常困惑。这一切都表明我在这段代码中做了一些你不应该做的事情——例如:调用
zipfile.open
在同一个 zip 文件对象中打开另一个文件的同时打开一个文件?使用 ZipFile(filename)
时这似乎不是问题。 ,但可能在传递 ZipFile
时会出现问题一个类似文件的对象,因为 zipfile
中的一些实现细节模块?也许我错过了
zipfile
中的某些内容文档?或者它可能还没有记录?或者(最不可能)zipfile
中的一个错误。模块?
最佳答案
我可能刚刚找到问题和解决方案,但不幸的是我不得不替换 Python 的 zipfile
我自己的一个被黑的模块(这里称为 myzipfile
)。
$ diff -u ~/run/lib/python2.7/zipfile.py myzipfile.py
--- /home/msabramo/run/lib/python2.7/zipfile.py 2010-12-22 17:02:34.000000000 -0800
+++ myzipfile.py 2011-04-11 11:51:59.000000000 -0700
@@ -5,6 +5,7 @@
import binascii, cStringIO, stat
import io
import re
+import copy
try:
import zlib # We may need its compression method
@@ -877,7 +878,7 @@
# Only open a new file for instances where we were not
# given a file object in the constructor
if self._filePassed:
- zef_file = self.fp
+ zef_file = copy.copy(self.fp)
else:
zef_file = open(self.filename, 'rb')
标准中的问题
zipfile
模块是当传递一个文件对象(不是文件名)时,它对每次调用 open
使用相同的传入文件对象。方法。这意味着 tell
和 seek
在同一个文件上被调用,因此试图在 zip 文件中打开多个文件会导致文件位置被共享,因此多个 open
通话导致他们互相踩踏。相反,当传递一个文件名时,open
打开一个新的文件对象。我的解决方案是针对传入文件对象的情况,而不是直接使用该文件对象,而是创建它的副本。此更改为
zipfile
解决了我看到的问题:$ ./test_zip_file.py ~/data.zip StringIO 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip BytesIO 1
Processed 250 items.
$ ./test_zip_file.py ~/data.zip direct 1
Processed 250 items.
但不知道对
zipfile
有没有其他负面影响...编辑:我刚刚在我以前不知何故忽略的 Python 文档中发现了这一点。在 http://docs.python.org/library/zipfile.html#zipfile.ZipFile.open , 它说:
Note: If the ZipFile was created by passing in a file-like object as the first argument to the constructor, then the object returned by
open()
shares the ZipFile’s file pointer. Under these circumstances, the object returned byopen()
should not be used after any additional operations are performed on the ZipFile object. If the ZipFile was created by passing in a string (the filename) as the first argument to the constructor, thenopen()
will create a new file object that will be held by the ZipExtFile, allowing it to operate independently of the ZipFile.
关于zip - 奇怪的 "BadZipfile: Bad CRC-32"问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5624669/