python - 用python读取多个pdf文件

在一个目录中有一堆带有文本的 PDF 文件。我的想法是能够一次阅读所有这些内容并保存在字典中。现在我只能通过使用 texttract 库来做到这一点，如下所示:

import textract

text = textract.process('/Users/user/Documents/Data/CLAR.pdf', 
                        method='tesseract', 
                        language='eng')

如何才能一次性读取它们？我是否需要使用 for 循环在目录中进行搜索或以其他方式进行搜索？

最佳答案

一种解决方案可能是将os库与for循环结合使用

import os
import textract

files_path = [os.path.abspath(x) for x in os.listdir()]

# Excluding not .pdf files
files_path = [pdf for pdf in files_path if '.pdf' in pdf]

pdfs = []
for file in files_path:
    text = textract.process(file,
                            method='tesseract',
                            language='eng')

    pdfs += [text]

获取当前目录下的所有文件
排除非 .pdf 文件
将文本保存到列表中(可以是不同的数据结构)

关于python - 用python读取多个pdf文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50678665/

上一篇：python - 在 Python 和 pandas 中轻松格式化 : Unknown format codes

下一篇：python - 无法解析 JSON 文件和设计查询中的值

PDF表格文本提取

Python:快速上传大文件S3

python - 解析 multi-fasta 文件以提取序列

python - 如何在 LibreOffice 中运行 python 宏？

php - 如何使用php渲染pdf文件

c#-4.0 - 使用 C# 代码将字节数组转换为 pdf

python - 如何用测试数据集预测y值？

java - 无法解析和显示从 http 请求中读取的非 utf8 字符

java - JAVA中的Yang解析