python - 使用内存 View 读取二进制文件

我在下面的代码中读取了一个大文件，该文件具有特殊的结构 - 其中有两个需要同时处理的 block 。我加载了包裹在 memoryview 调用中的这两个 block ，而不是在文件中来回查找

with open(abs_path, 'rb') as bsa_file:
    # ...
    # load the file record block to parse later
    file_records_block = memoryview(bsa_file.read(file_records_block_size))
    # load the file names block
    file_names_block = memoryview(bsa_file.read(total_file_name_length))
    # close the file
file_records_index = names_record_index = 0
for folder_record in folder_records:
    name_size = struct.unpack_from('B', file_records_block, file_records_index)[0]
    # discard null terminator below
    folder_path = struct.unpack_from('%ds' % (name_size - 1),
        file_records_block, file_records_index + 1)[0]
    file_records_index += name_size + 1
    for __ in xrange(folder_record.files_count):
        file_name_len = 0
        for b in file_names_block[names_record_index:]:
            if b != '\x00': file_name_len += 1
            else: break
        file_name = unicode(struct.unpack_from('%ds' % file_name_len,
            file_names_block,names_record_index)[0])
        names_record_index += file_name_len + 1

该文件已正确解析，但由于这是我第一次使用 mamoryview 界面，我不确定我是否做得正确。 file_names_block 的组成如以 null 结尾的 C 字符串所示。

我的技巧file_names_block[names_record_index:]是使用内存 View 魔法还是创建一些n^2切片？我需要在这里使用 islice 吗？
正如所见，我只是手动查找空字节，然后继续unpack_from。但我读到How to split a byte string into separate bytes in python我可以在内存 View 上使用 cast() (文档？) - 有什么方法可以使用它(或其他技巧)以字节为单位分割 View ？我可以直接调用 split('\x00') 吗？这会保持内存效率吗？

我希望了解一种正确的方法(在 python 2 中)。

最佳答案

内存 View 在处理空终止字符串时不会给您带来任何优势，因为它们除了固定宽度数据之外没有其他任何功能。您也可以在这里使用bytes.split():

file_names_block = bsa_file.read(total_file_name_length)
file_names = file_names_block.split(b'\00')

对内存 View 进行切片不会使用额外的内存( View 参数除外)，但如果使用强制转换，则在尝试访问元素时会为已解析的内存区域生成新的 native 对象按顺序排列。

您仍然可以使用memoryview进行file_records_block解析；这些字符串以长度为前缀，使您有机会使用切片。只需在处理 folder_path 值时保留内存 View 的字节切片即可，无需保留索引:

for folder_record in folder_records:
    name_size = file_records_block[0]  # first byte is the length, indexing gives the integer
    folder_path = file_records_block[1:name_size].tobytes()
    file_records_block = file_records_block[name_size + 1:]  # skip the null

由于 memoryview 源自 bytes 对象，因此索引将为您提供字节的整数值，.tobytes()给定的切片为您提供该部分的新字节字符串，然后您可以继续切片以将剩余部分留给下一个循环。

关于python - 使用内存 View 读取二进制文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40110306/

python - 使用内存 View 读取二进制文件

上一篇：python - 从 pandas 流中提取值

下一篇：python - 将 GAE 远程 API 连接到 dev_appserver.py