python - 使用 PDFMiner.Six 将 pdf 读入内存时出现问题

考虑以下代码片段:

import io
result = io.StringIO()
with open("file.pdf") as fp:
    extract_text_to_fp(fp, result, output_type='xml')

data = result.getvalue()

这会导致以下错误

ValueError: Codec is required for a binary I/O output

如果我省略output_type，我会收到错误

`UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 3804: character maps to <undefined>` instead.

我不明白为什么会发生这种情况，希望获得解决方法的帮助。

最佳答案

我想出了如何解决这个问题: 首先，您需要以二进制模式打开“file.pdf”。然后，如果您想读取内存，请使用 BytesIO 而不是 StringIO 并对其进行解码。例如

import io
result = io.BytesIO()
with open("file.pdf", 'rb') as fp:
    extract_text_to_fp(fp, result, output_type='xml')

data = result.getvalue().decode("utf-8")

关于python - 使用 PDFMiner.Six 将 pdf 读入内存时出现问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72137273/

上一篇：graphql - Gatsby 和 Strapi 可选数据的问题

下一篇：security - 在 Google App Script 代码中放入 secret 是否安全？

python - Flask 引发 TypeError : The view function did not return a valid response

python - 来自列表列表的字典理解python

python - QML + PyQt5 Material 风格

python - matplotlib.pyplot.streamplot 中的轴错误

django - 我可以使用不同的端口运行 3 个 uwsgi 服务吗

python - 如何将单个电位计值转换为 R、G、B？

python - PDF Miner PDF加密错误

python - 如何根据字体计算字符数？

python - 基于列中嵌套的 JSON 添加 DataFrame 列